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Abstract 


Our  thesis  is  that  a  geometric  perspective  yields  insights  into  the  structure  of  funda¬ 
mental  problems,  and  thereby  suggests  efficient  algorithms  for  them.  As  evidence,  we 
develop  new  geometric  models  and  general-purpose  tools  for  removing  outliers  from 
numeric  data,  reducing  dimensionality,  and  counting  combinatorial  sets.  Then  we  ap¬ 
ply  these  techniques  to  a  set  of  old  problems  to  obtain  polynomial-time  algorithms. 
These  include:  (1)  learning  noisy  linear-threshold  functions  (half-spaces),  (2)  learning 
the  intersection  of  half-spaces,  (3)  clustering  text  corpora,  and  (4)  counting  lattice 
points  in  a  convex  body.  We  supplement  some  of  our  theorems  with  experimental 
studies. 
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Chapter  1 
Introduction 


In  algorithm  discovery,  a  geometric  point  of  view  is  often  an  insightful  one.  A  won¬ 
derful  example  of  this  is  Linear  Programming  (LP).  Algorithms  for  LP  such  as  the 
Simplex  method,  the  Ellipsoid  method  and  Interior  point  methods  can  all  be  pre¬ 
sented  and  explained  in  purely  algebraic  terms.  However,  the  ideas  and  intuition 
behind  them  become  particularly  transparent  when  viewed  geometrically.  We  illus¬ 
trate  this  in  more  detail.  Consider  the  following  general  linear  program: 

max  CiXi  -f  C2X2  -f  .  .  .  -f-  CnXn 
subject  to  the  constraints: 

-|-  012X2  -f-  .  .  .  -f  OinX-n  <  61 
<^21X1  -f  022X2  +  •  •  •  <X2nXn  E:  ^2 

^mlXi  ^7712X2  ^TTlnXyi  ^  l^TTl 

This  is  a  linear  program  in  n  variables,  a;i,  0:2, . . . ,  a;„,  so  the  problem  is  in  i?". 
The  Cj’s,  ttij’s  and  b^s  are  specified  real  numbers.  Each  linear  constraint  corresponds 
to  a  half-space  (one  side  of  a  hyperplane)  in  il”.  Hence  their  intersection,  the  feasible 
region,  is  a  polyhedron  (the  higher-dimensional  analogue  of  a  polygon).  The  objective 
function  which  we  are  trying  to  maximize  corresponds  to  a  direction,  and  its  value 
of  at  a  point  x  is  simply  (a  scaling  of)  its  distance  from  the  origin  in  that  direction. 
From  this  perspective  it  is  intuitively  clear  that  the  maximum  will  be  achieved  at  a 
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point  that  is  farthest  away  from  the  origin  in  the  direction  specified  by  the  objective 
function.  Further,  since  the  feasible  region  is  a  polyhedron,  the  maximum  is  achieved 
at  some  corner  or  facet  (if  there  is  a  finite  maximum)  of  the  polyhedron.  The  Simplex 
method  is  then  a  very  natural  one:  start  at  some  vertex  of  polyhedron^  and  move  to 
an  adjacent  vertex  that  improves  the  value  of  the  objective  function,  i.e.,  is  farther 
away  in  that  direction  till  this  is  not  possible  and  then  declare  that  point  the  maxi¬ 
mum!  With  a  little  more  work  the  Ellipsoid  and  Interior-point  methods  can  also  be 
explained  in  a  similar  fashion. 

The  models  and  methods  presented  in  this  thesis  are  all  motivated  from  a  geo¬ 
metric  perspective.  In  some  cases,  the  original  statement  of  the  problem  is  not  in 
geometric  terms,  yet  recasting  it  in  such  terms  helps  us  find  efficient  algorithms. 
In  some  case  we  derive  the  first  polynomial-time  algorithms,  in  other  cases  where 
polynomial  algorithms  were  already  known,  we  improve  their  efficiency. 


1.1  Overview  of  new  results 

Outliers.  The  first  scenario  we  consider  is  a  rather  general  one.  We  are  presented 
with  a  set  of  points.  Each  point  has  a  fixed  set  of  numeric  attributes.  This  data 
could  be  the  input  to  an  algorithm,  e.g.,  the  training  set  of  a  learning  algorithm.  It  is 
possible  that  such  a  data  set  has  outliers.  Typically,  this  might  be  due  to  some  error 
in  collecting  the  data  etc.,  or  it  might  actually  correspond  to  an  interesting  pattern. 
In  any  case  a  useful  thing  to  do  would  be  to  find  (and  separate)  outliers  in  the  data. 

We  address  this  situation  by  first  posing  the  question:  what  precisely  is  an  outlier? 
At  first  sight,  our  definition  might  seem  rather  strong.  We  call  a  point  an  outlier  (with 
respect  to  a  given  set  of  points)  if  there  exists  any  direction  in  which  the  squared 
distance  of  the  point  from  the  origin  is  more  than  a  fixed  ratio  times  the  average 
squared  distance  of  the  data  set  in  that  direction  [12]. 

Given  this  definition,  two  questions  arise:  (1)  Is  it  possible  to  quickly  detect  such 
outliers?  (2)  Is  it  possible  to  remove  a  small  subset  of  the  points  so  there  are  no 
outliers  left?  We  are  able  give  a  polynomial-time  algorithm  for  detecting  outliers  in 
n-dimensional  data,  i.e.,  points  in  and  show  that  for  a  reasonable  ratio  of  outlier 
to  average,  the  number  of  outliers  is  at  most  a  small  fraction  of  the  total  number  of 
points.  Formally, 

Lemma  1  For  any  set  of  points  in  ,  each  given  to  b  bits  of  precision^  in  polynomial 


^This  is  typically  achieved  by  using  additional  slack  and  surplus  variables;  we  omit  the  details. 
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time  it  is  possible  to  remove  less  than  ^  fraction  of  the  set,  so  that  in  the  remaining 
set  T ,  for  every  vector  v  6  i?", 

1 

jrf 

i.e.,  the  maximum  squared  distance  in  any  direction  is  at  most  a  polynomially  bounded 
number  times  the  average. 

Learning  noisy  perceptrons.  As  an  application  of  the  lemma,  we  consider  the 
classical  problem  of  learning  a  linear  threshold  function  (a  half-space  in  n  dimensions, 
also  called  a  “perceptron”).  Methods  for  solving  this  problem  generally  fall  into  two 
categories.  In  the  absence  of  noise,  this  problem  can  be  formulated  as  a  Linear 
Program  and  solved  in  polynomial  time.  Alternatively,  simple  greedy  algorithms 
such  as  the  Perceptron  Algorithm  [50]  are  often  used  in  practice  and  have  certain 
provable  noise-tolerance  properties;  but,  their  running  time  depends  on  a  separation 
parameter,  which  quantifies  the  amount  of  “wiggle  room”  available  for  a  solution, 
and  can  be  exponential  in  the  description  length  of  the  input. 

We  show  that  after  removing  outliers  from  the  training  data,  a  simple  modifica¬ 
tion  of  the  Perceptron  Algorithm  finds  a  weak  hypothesis  in  polynomial  time  without 
dependence  on  any  separation  parameter.  Suitably  combining  these  hypotheses  re¬ 
sults  in  a  polynomial-time  algorithm  for  learning  linear  threshold  functions  even  in 
the  presence  of  random  classification  noise,  i.e.,  when  the  labels  of  examples  pre¬ 
sented  to  us  are  inverted  with  some  fixed  noise  probability.  The  chapter  includes 
some  experimental  studies  and  lists  other  potential  applications. 

Random  Projection.  The  next  scenario  we  examine  is  one  where,  as  above, 
the  data  points  presented  to  us  are  in  a  considerably  high  dimensional  space,  i.e., 
they  have  a  large  number  of  attributes.  What  we  would  like  to  do  now,  however, 
is  to  represent  the  points  in  a  suitable  lower  dimensional  subspace.  Of  course,  what 
constitutes  a  suitable  subspace  depends  on  the  specific  application  in  mind.  We  show 
that  a  very  simple  idea  —  project  the  points  to  a  random  lower  dimensional  subspace, 
i.e.,  a  random  hyperplane  through  the  origin  —  is  very  useful  in  identifying  a  good 
subspace  quickly.  We  analyze  this  technique  as  applied  to  two  different  problems: 
learning  the  intersection  of  half-spaces,  an  old  problem,  and  clustering  text  corpora, 
a  relatively  new  problem.  Below  we  discuss  these  examples  in  some  detail. 

Learning  the  intersection  of  half-spaces.  The  first  examples  is  again  from 
learning  theory.  An  excellent  illustration  of  the  complexity  of  learning  is  Blum  and 
Rivest’s  result  about  training  a  3-node  neural  network:  we  are  given  points  in  n- 
dimensional  space  each  colored  with  one  of  two  colors,  red  and  blue.  It  is  known  that 


xeT 


rmix{v  ■  x)'^  <  poly{n,b) 


6 


INTRODUCTION 


the  red  points  can  be  separated  from  the  blue  points  using  two  half-spaces.  Finding 
these  half-spaces,  equivalent  to  training  a  3-node  neural  network,  is  NP-hard  [14].  In 
spite  of  this  apparent  hardness,  it  cannot  be  ignored  that  the  intersection  of  2  or,  more 
generally,  k  half-spaces,  is  a  natural  generalization  of  a  perceptron  that  approximates 
a  simple  neural  network  used  in  many  machine  learning  applications. 

How  do  we  get  around  the  difficulty?  We  restrict  the  distributions  from  which 
labeled  examples  are  drawn.  Under  the  assumption  that  the  distribution  from  which 
points  are  presented  is  “non-concentrated”^,  we  present  a  simple  algorithm  to  learn 
the  intersection  of  k  half-spaces  in  IC  [60].  Generalizing  all  previous  algorithms  for 
the  problem,  our  approach  works  for  k  up  to  log  n/ log  log  n  in  polynomial-time.  In 
addition  it  explicitly  finds  a  set  of  0{k)  planes  which  agree  on  1  -  e  of  the  distribution 
with  the  true  set  of  k  planes  (with  high  probability).  Our  algorithm  is  inspired  by 
the  the  following  observation.  The  true  complexity  of  the  problem  is  determined 
not  by  the  dimension  n  or  the  number  of  half-spaces  k,  but  by  the  dimension  of  the 
subspace  spanned  by  their  normal  vectors  to  the  defining  hyperplanes  (the  “relevant” 
subspace).  The  key  step  is  a  projection  to  random  lines  (1-dimensional  subspaces)  to 
identify  the  relevant  subspace. 

Fast  Latent  Semantic  Indexing.  Our  second  example  is  drawn  from  the 
burgeoning  field  of  Information  Retrieval. 

Latent  Semantic  Indexing  (LSI)  [19]  is  an  information  retrieval  method  which 
attempts  to  capture  the  hidden  structure  in  a  corpus  of  documents  by  using  techniques 
from  linear  algebra.  Vectors  representing  the  documents  are  projected  in  a  new,  low¬ 
dimensional  space  obtained  by  singular  value  decomposition  of  the  term-document 
matrix  A.  This  low-dimensional  space  is  spanned  by  the  eigenvectors  of  A^A  that 
correspond  to  the  few  largest  eigenvalues  —  and  thus,  presumably,  to  the  few  most 
striking  correlations  between  terms  Queries  are  also  projected  and  processed  in  this 
low-dimensional  space.  This  results  not  only  in  great  savings  in  storage  and  query  time 
(at  the  expense  of  some  considerable  preprocessing),  but  also,  according  to  empirical 
evidence  reported  in  the  literature,  to  improved  information  retrieval  [10,  22,  23]. 
Indeed,  it  has  been  repeatedly  reported  that  LSI  outperforms,  with  regard  to  precision 
and  recall  in  standard  collections  and  query  workloads,  more  conventional  vector- 
based  methods. 

We  use  a  probabilistic  corpus  model  and  probabilistic  analysis  to  prove  rigorously 
that,  under  certain  conditions,  LSI  succeeds  in  capturing  the  underlying  semantics  of 

^The  probability  density  is  polynomially  bounded  from  above  and  inverse  polynomially  from 
below. 
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the  corpus  and  achieves  improved  retrieval  performance. 

Then,  we  propose  the  technique  of  random  projection  as  a  way  of  speeding  up 
latent  semantic  indexing.  This  idea  yield  an  interesting  improvement  on  LSI:  we  can 
perform  the  LSI  precomputation  not  on  the  original  term-document  matrix,  but  on 
a  low-dimensional  projection,  at  great  computational  savings  and  no  great  loss  of 
accuracy. 

The  last  result  can  be  seen  as  an  alternative  to  (and  to  some  extent,  a  justification 
of)  sampling  in  LSI.  Reports  on  LSI  experiments  in  the  literature  seem  to  suggest 
that  LSI  is  often  done  not  on  the  entire  corpus,  but  on  a  randomly  selected  subcorpus 
(both  terms  and  documents  may  be  sampled,  although  it  appears  that  most  often 
documents  are).  There  is  very  little  non-empirical  evidence  of  the  accuracy  of  such 
an  approach.  Our  result  suggests  a  different  and  somewhat  more  elaborate  approach 
—  projection  on  a  random  low-dimensional  subspace  —  which  can  be  proved  to  be 
accurate.  We  supplement  some  of  our  theorems  with  experiments  on  corpora  derived 
from  our  statistical  model. 

Sampling  lattice  points.  An  old  question  in  mathematics  concerns  the  size  of 
a  convex  body:  how  to  compute  its  volume  ? 

One  could  imagine  placing  the  body  in  a  grid  of  equally  spaced  points  and  then 
counting  up  the  number  of  points  that  were  inside  it.  Intuitively,  this  should  approx¬ 
imate  the  volume  of  the  body.  More  than  150  years  ago,  the  mathematician  Gauss 
turned  this  into  a  precise  question:  Exactly  when  does  the  volume  of  a  convex  body 
approximate  the  number  of  (unit-spaced)  lattice  points  inside  it? 

Using  a  celebrated  algorithm  of  Dyer,  Frieze  and  Kannan  [24],  we  derive  a  sufficient 
condition  in  answer  to  Gauss’  question:  roughly  speaking,  if  the  body  contains  a  ball 
of  radius  at  least  as  large  as  the  dimension  of  the  space,  then  the  volume  and  number 
of  lattice  points  are  within  a  constant  factor  of  each  other  (In  fact,  this  condition  is 
tight)  [40]. 

From  this  general  condition,  we  are  able  to  derive  polynomial-time  sampling  (and 
counting)  algorithms  for  various  special  cases  of  the  problem,  such  as  contingency 
tables.,  multi-dimensional  knapsack  problems,  and  integral  flows. 


1.2  Organization  of  this  dissertation 

In  the  rest  of  this  chapter  we  give  some  basic  mathematical  background.  In  chapter  2 
we  define  outliers,  show  how  to  remove  them,  and  apply  it  learning  perceptrons  and 
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noisy  perceptrons. 

In  chapters  3  and  4  we  apply  random  projection  in  two  different  scenarios,  first 
to  the  intersection  of  half-spaces,  then  to  quickly  approximating  the  eigenspace  of  a 
matrix. 

In  chapter  5  we  describe  an  algorithm  to  sample  lattice  points  in  a  convex  body 
and  give  some  applications. 

Chapters  2-5  can  be  read  independently  of  each  other.  The  background  section 
ahead  might  be  a  useful  reference  for  all  of  them. 


1.3  Mathematical  background 

We  recollect  some  basic  definitions  and  well-known  theorems  from  probability,  geom¬ 
etry,  algebra,  and  learning  theory.  The  material  in  this  chapter  is  not  meant  to  be 
comprehensive.  It  concentrates  on  results  that  we  will  employ  in  the  chapters  ahead. 

Probability 

Let  Xi,X2, . . . , Xn  be  independent  random  variables  with  finite  expectations  and 
variances. 

Let 


S  =  Xr+...TXn,  X  =  S/n, 

H  =  E[A^  =  Ej.S'/n],  =  nvar(A)  =  (varfi'j/n. 

Then  the  following  upper  bounds  can  be  placed  on  the  probability  that  the  sum  of 
the  random  variables  deviates  from  its  expectation.  The  first  inequality  below  is  the 
Bienayme-Chebyshev  inequ'ality  and  the  latter  two  are  Hoeffding’s  inequalities  [35]. 

Prl|X-^|><]  <  N  (1.1) 

Assuming  that  for  alH,  0  <  a;-  <  1, 


Pr[X-ii>t]  <  (1.2) 

An  extension  of  the  previous  bound  where  we  assume  that  for  each  ai  <  Xi  <  bi^ 


(1.3) 


1.3.  MATHEMATICAL  BACKGROUND 
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Geometry 

Bn  refers  to  the  ball  of  unit  radius  in  R^. 

The  volume  of  Bn  is 

2r”7r”/2 

nV{nl2) 

The  volume  of  a  cone  in  il"  of  height  h  and  base  radius  r  is 

vol5„_ir”h 

n 

Linear  algebra 

Given  a  real  square  matrix  A  with  n  rows  and  n  columns,  a  vector  v  G  is  an 
eigenvector  of  A  with  eigenvalue  X  if 

Av  =  Xv. 

For  more  on  the  theory  of  eigenspaces  we  refer  the  reader  to  [32,  61]. 

Learning  theory 

We  recall  two  basic  definitions  in  learning  theory.  One  is  Valiant’s  notion  of 
Probably  Approximately  Correct  learning  (PAG  learning)  [57]. 

In  the  PAC-model  we  assume  that  examples  are  being  provided  from  some  fixed 
(but  possibly  unknown)  distribution.  Given  an  example  distribution  D,  the  error  of 
a  hypothesis  h  with  respect  to  a  target  concept  c  is  Proba;gjr>[h(a;)  /  c(rr)]. 

In  the  definition  below,  n  is  the  size  of  a  single  example. 

An  algorithm  A  PAC-learns  a  concept  class  C  by  hypothesis  H  if,  for  any  c  £  C, 
any  distribution  D  over  the  instance  space,  any  e,  5  >  0,  and  for  some  polynomial 
p,  the  following  is  true.  Algorithm  A  with  access  to  labelled  examples  of  c  drawn 
from  distribution  D  produces  with  probability  at  least  1  —  (J,  a  hypothesis  h  e  H 
with  error  at  most  c.  In  addition,  A  should  do  so  after  running  for  time  at  most 
p(n,  le,  15,  size{c))  (this  trivially  puts  the  same  upper  bound  on  the  number  of  exam¬ 
ples  seen  by  the  algorithm). 

The  second  important  definition  is  the  VC-dimension  of  a  concept  class  [59]. 

For  this  we  say  that  a  set  of  points  S  is  shattered  by  a  concept  class  C  if  there  are 
concepts  in  C  that  partition  S  in  all  of  the  2l'^l  possible  ways,  i.e.,  all  possible  ways 
of  classifying  S  are  achievable  using  concepts  in  C . 

Then  the  VC-dimension  of  a  concept  class  is  the  size  of  the  largest  set  of  points 
that  can  be  shattered  by  C. 
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INTRODUCTION 


Let  a  hypothesis  be  a  bad  hypothesis  if  its  error  is  more  than  e.  Then  the  following 
nice  theorem  places  an  upper  bound  on  the  number  of  examples  required  for  uniform 
convergence  (i.e.,  till  all  bad  hypotheses  see  a  wrong  example). 


m  =  0{-{VCdim{C)  log(-)  +  log  i)). 
c  e  o 


Theorem  1  [59] 


Chapter  2 

Removing  Outliers  from  Numeric 
Data 


IVe  present  a  robust  notion  of  outliers  in  numeric  data,  and  a  polynomial-time  algo¬ 
rithm  to  remove  a  small  fraction  of  any  set  of  points  so  that  the  remaining  set  has  no 
such  outliers. 


2.1  Introduction 

The  term  ^‘outlier”  is  a  faitiiliar  one  in  many  contexts.  Statisticians  have  several  no¬ 
tions  of  outliers.  Typically  these  notions  quantify  how  far  the  outlier  is  from  other 
values,  e.g.,  the  difference  between  the  outlier  and  the  mean  of  all  points,  the  dif¬ 
ference  between  the  outlier  and  the  mean  of  the  remaining  values,  or  the  difference 
between  the  outlier  and  the  next  closest  value.  In  addition  this  difference  might  be 
normalized  by  some  measure  of  the  ^'scatter”  of  the  set,  such  as  the  range  or  the 
standard  deviation.  Points  that  are  outside  some  cut-off  are  labelled  outliers. 

One  possible  cause  for  the  presence  of  outliers  is  experimental  error.  In  this  case, 
of  course,  it  is  desirable  to  detect  and  remove  them.  Even  if  this  is  not  the  case, 
removing  outliers  often  gives  a  much  clearer  or  simpler  explanation  for  the  remaining 
set.  Outliers  in  the  data  given  to  a  computer  program  could  affect  its  performance. 
Conceivably  they  could  slow  down  or  even  mislead  an  algorithm.  Machine  learning 
algorithms  are  an  example  where  outliers  in  the  training  data  might  lead  the  algorithm 
to  find  a  wayward  hypothesis. 

How  does  one  detect  outliers?  Simple  heuristics  for  this  are  based  on  defining 
them  as  above.  In  the  univariate  or  bivariate  cases  (  I-dimensional  or  2-dimensional) 


II 
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one  could  simply  plot  the  points,  and  visually  decide  which  ones  are  astray,  perhaps 
because  they  are  too  deviant  in  one  of  the  coordinates.  In  general,  i.e.,  in  the  multi¬ 
variate  case,  this  need  not  be  true.  An  outlier  could  be  far  away  from  the  rest  of  the 
points  without  being  so  in  any  one  coordinate. 

We  develop  a  robust  definition  of  outliers  for  a  point  set  (or  a  probability  distri¬ 
bution)  in  n-dimensional  space,  i?”.  Roughly  speaking,  our  notion  is  that  a  point  is 
an  outlier  if  it  deviates  by  some  prescribed  amount  from  the  average  in  any  direction 
(not  just  one  of  the  coordinate  axis  directions). 

Given  this  definition,  two  questions  arise:  (1)  Is  it  possible  to  quickly  detect  such 
outliers?  (2)  Is  it  possible  to  remove  a  small  subset  of  the  points  so  there  are  no 
outliers  left? 

The  second  question  is  related  to  the  following  concern.  Suppose  we  find  the 
outliers  and  remove  them  from  the  data.  It  is  then  possible  that  points  that  were 
previously  not  outliers  now  become  outliers.  Can  this  happen  repeatedly,  so  that  we 
end  up  removing  most  of  the  data  set? 

In  this  chapter,  we  show  that  for  a  reasonable  ratio  of  outlier:average,  the  number 
of  outliers  is  at  most  a  small  fraction  of  the  total  number  of  points,  i.e.,  on  removing 
this  fraction  of  points  the  remaining  data  set  has  no  outliers.  We  give  a  polynomial 
time  algorithm  for  detecting  outliers  in  il”. 

This  chapter  is  organized  into  the  following  sections.  First  we  state  our  Outlier 
Removal  Lemma  precisely.  Then  we  give  an  algorithmic  proof  of  the  lemma,  i.e., 
we  show  that  it  is  indeed  possible  to  remove  a  small  number  of  points  so  that  the 
remaining  data  has  no  outliers,  and  describe  how  to  do  this  efficiently.  In  the  next 
section  we  apply  the  lemma  to  obtain  an  efficient  algorithm  for  learning  a  noisy  half¬ 
space.  The  following  section  contains  a  discussion  of  some  experiments  we  conducted 
that  suggest  that  the  technique  might  have  wide-ranging  applications.  We  conclude 
with  some  remarks  about  possible  improvements. 


2.2  The  Outlier  Removal  Lemma 

We  assume  that  all  points  are  given  to  some  b  bits  of  precision.  More  precisely,  we 
define  R  =  {p/q  :  |p|,  \q\  G  {0, 1,  2, . . . ,  2*"  —  1},  q  ^  0},  and  assume  that  our  point  set 
S  is  restricted  to  (i.e.,  R  x  ■  ■  ■  x  R).  Our  main  lemma  states  that  given  a  set  of 
data  points  in  I^,  one  can  remove  a  small  portion  and  guarantee  that  the  remainder 
contains  no  outliers  in  a  certain  well-defined  sense. 


2.2.  THE  OUTLIER  REMOVAL  LEMMA 
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Lemma  2  (Outlier  Removal  Lemma)  For  any  set  S  C  and  any  e  >  0,  there 
exists  a  subset  S'  C  S  such  that: 

(i)  |S"|  >  (1  -  £-2-"^)|6'|,  and 

(ii)  for  every  vector  w  G  BE ,  max2;£5/(t«  •  x  Y  S:  /^E3;g5/[(w  •  xY\, 

where  f3  =  0{n'^hle).  Moreover,  such  a  set  S'  can  be  computed  in  polynomial  time. 

The  algorithm  for  computing  the  set  S'  of  Lemma  2  is  quite  simple.  It  is  as  follows: 
First,  we  may  assume  that  the  matrix  X  of  points  in  S  has  rank  n;  otherwise  we 
simply  drop  to  the  subspace  spanned.  Next  we  calculate  a  symmetric  factorization 
of  XX^  as  A^  =  XX^  which  can  be  done  by  a  standard  eigenvalue/eigenvector 
computation.  Then,  we  perform  the  linear  transformation  A~^X.  The  new  set  of 
points  (or  viewed  alternatively,  the  transformed  space)  has  the  property  that,  for 
every  unit  vector  w,  Ea;es[(u;  •  a;)^]  =  1.  Then  we  remove  all  points  x  ^  S  such  that 
>  /?/144n.  If  S  now  satisfies  the  condition  of  the  theorem  we  stop.  Otherwise, 
we  repeat. 

The  difficult  issue  is  proving  that  this  algorithm  will  in  fact  halt  before  removing 
too  many  points  from  S.  To  do  this  we  show  that  at  each  iteration  as  described  above 
the  volume  of  an  associated  dual  ellipsoid  doubles.  From  our  assumption  that  points 
are  given  to  a  fixed  number  of  bits,  we  can  derive  an  absolute  upper  bound  on  the 
maximum  volume  of  the  ellipsoid,  thus  bounding  the  number  of  iterations.  Although 
the  idea  is  simple,  the  proof  involves  some  detailed  calculation.  The  reader  could 
proceed  directly  to  section  2.3  without  loss  of  continuity. 

2.2.1  An  algorithmic  proof  of  the  lemma 

For  a  set  S'  C  R”  (S  need  not  be  finite)  and  a  distribution  //on  i?”,  let 

14;(5)  =  {re  G  R"  :  E,[{w'^xf  |  x  €  S']  <  1}. 

We  will  drop  the  subscript  p.  (on  both  W  and  on  the  expectation  E)  when  the  distri¬ 
bution  is  clear  from  context.  The  key  to  our  proof  is  the  following  lemma. 

Lemma  3  Let  p  be  a  measure  on  R"  which  is  not  concentrated  on  a  subspace  of 
dimension  less  than  n  (i.e.  the  total  measure  on  any  subspace  of  dimension  less  than 
n  is  less  than  1).  Then,  for  any  0  <  a  <  l/3n,  j3  =  36n^/a  and  n  sufficiently  large, 
there  exists  an  ellipsoid  S  C  R”  such  that 
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(a)  Pr(3;  ^  S)  <  a. 

(b)  Either 

(i)  for  all  w  €  EJ'' ,  max{(tu'^x)^  ;  x  G  .S'}  <  ISE{{io^xY  |  x  G  5'),  or 

(ii)  vol{W{S))  >  2vol{W[R^)). 


Proof.  In  the  entire  proof,  probabilities  and  expectations  are  w.r.t.  the  distri¬ 
bution  p.  Let 

M  =  E(xx^) 

= 

where  A  is  symmetric,  and  non-singular  by  assumption.  Then 

E((ty^x)^)  =  vp-  Mw 

for  all  w  G  Jl".  Now  let 

E  =  {x  G  il"  :  (le^x)^  <  uA Mw,Vw  G  il”) 

=  {x  G  :  ((Awf(A-^x)y  <  \Aw\\\/w  G  il”} 

=  {x  G  i?”  :  |A-ix|  <  1}. 

Note  that  this  shows  that  E  is  an  ellipsoid.  Putting  5:  =  A~^x  we  see  that  for  any 
7  >  0, 


Pr(^x  ^  7E) 


=  Prd^l  >  7) 

<  ^Pr(|^j|  >  7/y«) 

1=1 

1=1 


by  the  Chebychev  inequality. 
But, 


E{zz^)  =  EiA-^xx^A-^) 
=  I 


and  so 

Pr(x  ^  7E)  < 


2.2.  THE  OUTLIER  REMOVAL  LEMMA 
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We  now  take  7  =  ^  S  =  'jE  and  we  see  that  (a)  of  the  lemma  is  satisfied. 

We  now  consider  two  possibilities: 

Case  (i)  For  all  w  G  il”, 

E{{w^xf  \xeS)>  ^^E{{w^xf)//3. 

In  this  case,  for  any  w  G  il",  by  the  definitions  of  E  and  S, 


max{(rt>^a:)^)  <  7^E((u;^a:)^) 

x^S 

<  fIE{{w'^xy  I  X  G  S). 
Case  (ii)  There  exists  w  G  jR"  such  that 

E{{w'^xy  I  a;  G  5)  <  •^'^E{{viF x)^) j (3 . 
Let 


Ml  =  E(a;rr^  |  a:  G  S'). 


So 


E((u;^a:)^  |  x  G  S')  =  xtF Mw. 
We  complete  the  lemma  by  showing  that 

vol(ri)  >  2vol(T), 

where 


(2.1) 


(2.2) 


T  =  W{W) 

=  {m  G  il”  :  Mw  <  1} 

and 

Ti  =  W(S) 

=  {tc  G  Jl”  :  Miw  <  1} 

It  will  be  convenient  to  show  that 

vol(2iri)  >  2vol(2lT),  (2.3) 

which  is  equivalent  to  (2.2)  because  the  linear  transformation  A  multiplies  volumes 
by  \det{A)\. 
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Note  next  that  by  substituting  v  =  Aw  we  see  that 

AT  =  {veR^:  v'^A-^MA-h^  <  1} 

=  {u  G  -R”  :  v'^v  <1} 

=  Bn, 

where  Bn  is  the  unit  ball  in  R". 

Furthermore, 

E{{w'^xf  \xeS)<{l-  a)-^E{(w'^xY) 
which  follows  from  E{{w'^xY)  >  E{{w'^xY  \  x  G  5')Pr(a;  G  S).  So, 

ATi  =  {u  G  R"  :  v^A-^MiA-^v  <  1} 

3  {u  G  R”  :  v'^A-^MA-^  <l-a} 


Also,  ATi  contains  a  vector  of  length  A  =  Indeed,  let 

Aw 


V  —  \ 


|Ar<;| 


Then,  v  G  ATi  because,  from  (2.1), 


v^A-^MiA-H^  = 


A2 


\Aw\ 


-wMiw 


^.T 


7^  T 

<  j—^—wMw^ 
\Aw\^  (5 

=  1. 


(2.4) 


Since  ATi  contains  an  n  —  1  dimensional  ball  around  the  origin  and  a  point  at  a 
distance  of  1  —  a  from  the  center  of  the  ball,  from  the  convexity  of  ATi  it  follows  then 
that  ATx  contains  a  cone  with  base  an  {n  —  l)-dimensional  ball  of  radius  1  —  a  and 
height  A. 

Thus  if  Vn  denotes  the  volume  of  R„  we  see,  using  the  bound  on  a,  that 

vol(Ari)  ^  A14_i(l-«)”-i 

vol(Ar)  -  nVn 

^  A(l-aO"-i 

>  2. 


□ 
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We  now  specialize  the  above  result  to  the  case  where  is  concentrated  on  /^.  Let 
Lo  =  {xe  4”  :  fiix)  >  2"^”^}.  So,  n{Lo)  >  1  -  >  1  -  2'”^ 

Let  /Jo  denote  the  measure  induced  on  Lq  by  ^  i.e.  fJ.o{x)  =  iJ,{x)/ fi{Lo)  for 
X  €  Lq.  We  consider  applying  the  construction  of  Lemma  3,  K  times  starting  with 
/Uq.  In  general  we  would  expect  to  construct  a  sequence  of  ellipsoids  Si  This  assumes 
Case  (bii)  always  occurs.  Let  Hi  denote  the  measure  induced  on  D  52  H  •  •  •  D  by 
jjLQ.  It  is  possible  that  is  concentrated  on  a  subspace  Vi  of  lower  dimension.  If  so, 
we  simply  work  within  Vi  from  then  on.  This  cannot  happen  more  than  n  times. 

Suppose  that  Case  (bi)  never  occurs.  Then  there  exists  a  subspace  Vk  of  dimension 
u  and  ellipsoids  5i,  ^2, . . . ,  Sr-  such  that  if  =  To  C  5i  fl  ^2  C  •  •  •  C  Sk  H  Vk  then 

(a)  dim(rA-)  =  z/. 

(b)  ij,q{Tk)  >  1  —  OiK. 

(c)  vo14W(Ta-))  >  2^'/4 

where  in  (c), 

W{Tk)  =  {tc  e  Va'  :  E((u;^a;)2  |  a;  €  Tk)  <  1}. 

Part  (c)  takes  into  account  the  doubling  of  volume  K  times,  and  restarting  each 
time  we  move  to  a  lower  dimensional  subspace  (at  most  n  times). 

The  above  is  not  possible  for  sufficiently  large  K  as  we  will  now  show  by  bounding 
the  length  of  each  w  G  Tk-  By  assumption,  Tk  contains  v  linearly  independent  vectors 
Ui,  U2, . . . ,  Ui,  G  /^.  For  any  such  set  of  vectors, 

E{{w'^xf  I  X  G  Tk)  >  '^{w'^Vi)^iJ.K{vi) 

i=l 

«  =  1 

So  if  m  G  W{Tk)  then 

<  2"”'-  (2.5) 

i=l 

Let  B  denote  the  n  x  n  matrix  Yli=i  so  that 

uF  Bw  =  ViY .  (2.6) 

i=l 

Let  B  have  eigenvalues  0  =  Ai  =  A2  =  •  •  •  =  A„_^  <  A  =  Xn-u+i  <  A„_^+2  <  •  •  •  A„. 
Let  01,02, . . . ,  be  a  corresponding  orthonormal  basis  of  eigenvectors.  Now  if  lu  G 
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Vk,w  ^  0,  then 


vP-  Bw 
w'^w 


>  A. 


(2.7) 


Since  uA Vi  ^  0  for  at  least  one  i,  we  can  apply  (2.6).  But  A  ^  0  is  a  root  of  a 
polynomial  of  degree  at  most  n  —  1  with  rational  coefficients  Ot/A  where  |a,|,  \j3i\  < 
n!2"^  By  a  simple  computation,  this  implies  that  A  >  (n!2”^)“^"  and  so  (2.5),  (2.6 
and  (2.7)  imply  that  if  m  6  W{Tk)  then 


licp  <  2‘*"*’2^"'^(n!)2™  <  2“*”'^ 


(for  b  >  log  n)  and  so 

yoI{W{Tk))  <  (2^”''')”/2. 

This  is  a  contradiction  for  K  >  Kq  =  2iMb.  We  deduce  then  that 

Theorem  2  For  any  0  <  a  <  l/3n  and  f3  =  /a  and  fx  concentrated  on  there 
exist  k  <  Kq  ellipsoids  Si  such  that  if  S  =  nf=i  Si 

(i)  jx{S)  >l-ka-  2-”^ 

(ii)  max{(u;^a;)^  :  x  e  S}  <  /3E{{w'^xy  \  x  e  S),  for  all  w  G  . 

The  previous  discussion  has  been  existential  in  nature  and  we  now  show  how  to 
make  it  constructive.  This  is  relatively  easy  for  a  finite  set  of  m  points  (i.e  p  is 
concentrated  on  the  m  points).  Now  if  we  apply  the  above  theorem  to  p  then  all  of 
the  ellipsoids  and  subspaces  are  computable  in  polynomial  time. 

One  way  to  view  the  algorithm  is  the  following.  We  wish  to  find  a  set  of  points  with 
the  property  that  in  any  direction  lo,  the  maximum  squared  value  of  the  projection 
of  points  in  that  direction  is  not  much  more  than  the  average.  If  initially  there  is 
a  direction  where  this  is  not  true,  we  apply  a  transformation  to  the  points  {A~^x^ 
above)  that  results  in  their  inertial  ellipsoid  becoming  the  unit  ball.  Then  we  drop 
all  points  outside  a  multiple  7  of  this  ellipsoid  and  repeat  on  the  smaller  set  of  points 
(with  their  original  coordinates).  This  cannot  go  on  forever  since  we  assume  that  the 
points  are  represented  by  bounded  rationals  and  an  associated  ellipsoid  is  doubling 
in  volume  at  each  iteration. 

Note  that  we  can  make  the  method  constructive  for  the  infinite  case  as  well  by 
picking  a  sample  of  points  and  applying  VC-dimension  arguments. 


2.3.  AN  APPLICATION:  LEARNING  A  NOISY  HALF-SPACE 
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2.3  An  application:  learning  a  noisy  half-space 

The  problem  of  learning  a  linear  threshold  function  is  one  of  the  oldest  problems 
studied  in  machine  learning.  Typically,  this  problem  is  solved  by  using  simple  greedy 
methods.  For  instance,  one  commonly-used  greedy  algorithm  for  this  task  is  the 
Perceptron  Algorithm  [54,  5],  described  below  in  Section  2.3.1.  These  algorithms  have 
running  times  that  depend  on  the  amount  of  “wiggle  room”  available  to  a  solution. 
In  particular,  the  Perceptron  Algorithm  has  the  following  guarantee  [50].  Given  a 
collection  of  data  points  in  il”,  each  labeled  as  positive  or  negative.,  the  algorithm 
will  find  a  vector  w  such  that  ic  •  a;  >  0  for  all  positive  points  x  and  rc  •  a;  <  0  for  all 
negative  points  x,  if  such  a  vector  exists.^  Moreover,  the  number  of  iterations  made 
by  the  algorithm  is  at  most  1  ja'^  where  a  is  a  “separation  parameter”  defined  as  the 
largest  value  such  that  for  some  vector  w*,  all  positive  x  satisfy  cos(u;*,ar)  >  cr,  and 
all  negative  x  satisfy  cos(ac*,  a;)  <  — cr,  where  cos(a,  h)  =  -j^j^  is  the  cosine  of  the  angle 
between  vectors  a  and  b. 

Unfortunately,  it  is  possible  for  the  separation  parameter  a  to  be  exponentially 
small,  and  for  the  algorithm  to  take  exponential  time,  even  if  all  the  examples  belong 
to  {0,1}".  A  classic  setting  in  which  this  can  occur  is  a  data  set  labeled  according 
to  the  function  “if  a;i  =  1  then  positive  else  if  X2  =  1  then  negative  else  if  3:3  =  1 
then  positive,  ...”.  This  function  has  a  linear  threshold  representation,  but  it  requires 
exponentially  large  weights  and  can  cause  the  Perceptron  Algorithm  to  take  expo¬ 
nential  time.  (In  practice,  though,  the  Perceptron  Algorithm  and  its  variants  tend  to 
do  fairly  well;  e.g.,  see  [7].) 

Given  this  difficulty,  one  might  propose  instead  to  use  a  polynomial-time  linear 
programming  algorithm  tq  find  the  desired  vector  w.  Each  example  provides  one 
linear  constraint  and  one  could  simply  apply  an  LP  solver  to  solve  them  [44,  41,  48]. 
In  practice,  however,  this  approach  is  less  often  used  in  machine  learning  applications. 
One  of  the  main  reasons  is  that  the  data  often  is  not  consistent  with  any  vector  w 
and  the  goal  is  simply  to  do  as  well  as  one  can.  And,  even  though  finding  a  vector 
w  that  minimizes  the  number  of  misclassified  points  is  NP-hard,  variants  on  the 
Perceptron  Algorithm  typically  do  well  in  practice[31,  6].  In  fact,  it  is  possible  to 
provide  guarantees  for  variations  on  the  Perceptron  Algorithm  in  the  presence  of 
inconsistent  data  (e.g.,  see  [15,  16,  42]^),  under  models  in  which  the  inconsistency  is 

^If  a  non-zero  threshold  is  desired,  this  can  be  achieved  by  adding  one  extra  dimension  to  the 
space. 

^The  word  “polynomial”  in  the  title  of  [15]  means  polynomial  in  the  inverse  of  the  separation 
parameter,  which  as  noted  above  can  be  exponential  in  n  even  when  points  are  chosen  from  {0, 1}". 
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produced  by  a  sufficiently  “benign”  process,  such  as  the  random  classification  noise 
model  discussed  below. 

We  are  given  access  to  examples  (points)  drawn  from  some  distribution  D  over 
i?”.  Each  example  is  labeled  as  positive  or  negative.  The  labels  on  examples  are 
determined  by  some  unknown  target  function  re*  •  x  >  0  (i.e.,  x  is  positive  if  to*  •  x  >  0 
and  is  negative  otherwise)  but  each  label  is  then  flipped  independently  with  some 
fixed  probability  rj  <  1/2  before  it  is  presented  to  the  algorithm.  77  is  called  the  noise 
rate. 

Our  goal  is  to  find  an  algorithm  that  for  any  (unknown)  distribution  D,  any 
(unknown)  target  concept  to*  •  x  >  0,  any  (unknown)  rj  <  1/2,  and  any  inputs 
e,S  >  0,  with  probability  at  least  1  —  (^  produces  a  hypothesis  whose  error  with 
respect  to  the  target  function  is  at  most  e.  The  algorithm  may  request  a  number  of 
examples  polynomial  in  n,b,  l/e,log(|),  and  and  should  run  in  time  polynomial 
in  these  parameters  as  well. 

Here  we  present  a  version  of  the  Perceptron  Algorithm  that  maintains  its  proper¬ 
ties  of  noise-tolerance,  while  providing  polynomial-time  guarantees.  Specifically,  the 
algorithm  we  present  is  guaranteed  to  provide  a  weak  hypothesis  (one  that  correctly 
classifies  noticeably  more  than  half  of  the  examples)  in  time  polynomial  in  the  descrip¬ 
tion  length  of  the  input  and  not  dependent  on  any  separation  parameter.  The  output 
produced  hy  the  algorithm  can  be  thought  of  as  a  “thick  hyperplane,”  satisfying  the 
following  two  properties: 

1.  Points  outside  of  this  thick  hyperplane  are  classified  with  high  accuracy  (points 
inside  can  be  viewed  as  being  classified  as  “I  don’t  know”). 

2.  At  least  a  I /poly  fraction  of  the  input  distribution  lies  outside  of  this  hyperplane. 

This  sort  of  hypothesis  can  be  easily  boosted  in  a  natural  way  (by  recursively  running 
the  algorithm  on  the  input  distribution  restricted  to  the  “don’t  know”  region)  to 
achieve  a  hypothesis  of  arbitrarily  low  error. ^  This  yields  the  following  theorem. 

^Thanks  to  Rob  Schapire  for  pointing  out  that  standard  Boosting  results  [55,  29]  do  not  apply  in 
the  context  of  random  classification  noise.  (It  is  an  open  question  whether  arbitrary  weak-learning 
algorithms  can  be  boosted  in  the  random  classification  noise  model.)  Thus,  we  use  the  fact  that  the 
hypothesis  produced  by  the  algorithm  can  be  viewed  as  a  high-accuracy  hypothesis  over  a  known, 
non-negligible  portion  of  the  input  distribution.  Alternatively,  Aslam  and  Decatur  [3]  have  shown 
that  Statistical  Query  (SQ)  algorithms  can,  in  fact,  be  boosted  in  the  presence  of  noise.  Since  our 
algorithm  can  be  made  to  fit  the  SQ  framework  (see  Section  2.3.5),  we  could  also  apply  their  results 
to  achieve  strong  learning. 
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Theorem  3  The  class  of  linear  threshold  functions  in  FC  can  be  learned  in  poly¬ 
nomial  time  in  the  PAC  prediction  model  in  the  presence  of  random  classification 
noise. 

Remark:  The  learning  algorithm  can  be  made  to  fit  the  Statistical  Query  learning 
model  [42], 

The  main  idea  of  our  result  is  as  follows.  First,  we  modify  the  standard  Perceptron 
Algorithm  to  produce  an  algorithm  that  succeeds  in  weak  learning  unless  an  over¬ 
whelming  fraction  of  the  data  points  lie  on  or  very  near  to  some  hyperplane  through 
the  origin.  Specifically,  the  algorithm  succeeds  unless  there  exists  some  “bad”  vector 
w  such  that  most  of  the  data  points  x  satisfy  |cos(ic,a:)|  <  5  for  some  small  >  0. 
Thus,  we  are  done  if  we  can  somehow  preprocess  the  data  to  ensure  that  no  such  bad 
vector  w  exists.  We  apply  the  Outlier  Removal  Lemma  to  ensure  this. 

Recall  that  the  lemma  tells  us  that  given  any  set  S  of  points  in  n  dimensional 
space,  each  requiring  b  bits  of  precision,  one  can  remove  only  a  small  fraction  of  those 
points  and  then  guarantee  that  in  the  set  T  remaining,  for  every  vector  v, 

max(x  •  v)'^  <  poly{n,b)E3:^T[{^  • 

xET 

In  this  sense,  the  set  remaining  has  no  outliers  with  respect  to  any  hyperplane  through 
the  origin.  In  addition,  these  outliers  can  be  removed  in  polynomial  time.  After 
removing  the  outliers,  we  can  then  apply  a  linear  transformation  so  that  in  the  trans¬ 
formed  space,  for  every  unit  vector  i;, 

•  vy]  =  1  cLnd  max(:r  •  vY  <  po/y(n,6). 

x^T 

Because  the  maximum  is  bounded,  having  the  expectation  equal  to  1  means  that  for 
every  hyperplane  through  the  origin,  at  least  a  l/po/y(n,6)  fraction  of  the  examples 
are  at  least  a  l/poly(n,b)  distance  away,  which  then  allows  us  to  guarantee  that  the 
modified  Perceptron  Algorithm  will  be  a  weak  learner. 

2.3.1  The  Perceptron  Algorithm 

The  Perceptron  Algorithm[54,  5]  operates  on  a  set  S  of  labeled  data  points  in  n 
dimensional  space.  Its  goal  is  to  find  a  vector  w  such  that  w  •  x  >  0  for  all  positive 
points  X  and  re  •  a;  <  0  for  all  negative  points  x.  We  will  say  that  such  a  vector  w 
correctly  classifies  all  points  in  S.  If  a  non-zero  threshold  value  is  desired,  this  can 
be  handled  by  simply  creating  an  extra  (n  -f  l)st  coordinate  and  giving  all  examples 
a  value  of  1  in  that  coordinate. 
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For  convenience,  define  I{x}  (the  label  of  x)  to  be  1  if  x  is  positive  and  —1  if  x  is 
negative.  So,  our  goal  is  to  find  a  vector  w  such  that  I{x){w  •  x)  >  0  for  all  x  G  S'. 
Also,  for  a  point  x  let  x  =  x/|x|.  I.e.,  x  is  the  vector  x  normalized  to  have  length  1. 


2.3.2  The  standard  algorithm 

The  standard  algorithm  proceeds  as  follows.  We  begin  with  w  =  6.  We  then  perform 
the  following  operation  until  all  examples  are  correctly  classified: 

Pick  some  arbitrary  misclassified  example  x  G  S  and  let  w  w  +  £{x)x. 

A  classic  theorem  (see  [50])  describes  the  convergence  properties  of  this  algorithm. 

Theorem  4  [50]  Suppose  the  data  set  S  can  be  correctly  classified  by  some  unit  vector 
w* .  Then,  the  Perceptron  Algorithm  converges  in  at  most  l/cr^  iterations,  where 
cr  =  minx^s\w*  •  .x|. 

Proof.  Consider  the  cosine  of  the  angle  between  the  current  vector  w  and  the 
unit  vector  w*  given  in  the  theorem.  That  is,  .  In  each  step  of  the  algorithm, 
the  numerator  of  this  fraction  increases  by  at  least  a  because  {w  +  f(x)x)  •  w*  — 
w  ■  w*  -\-  l{x)x  •  w*  >  w  •  w*  -\-  a.  On  the  other  hand,  the  square  of  the  denominator 
increases  by  at  most  1  because  \w-\-l{x)x\^  =  \w\^ -[-2£{x){w  x)Al  <  [rcp-fl  (since  x 
was  misclassified,  this  means  the  crossterm  is  negative).  Therefore,  after  t  iterations, 
w  ■  w*  >  ta  and  |ry|  <  \/i.  Notice  that  the  former  cannot  be  larger  than  the  latter. 
Thus,  t  <  1/(7^.  □ 

2.3.3  A  modified  version 

We  now  describe  a  modified  version  of  the  Perceptron  Algorithm  that  will  be  needed 
for  our  construction.  Recall  our  notation  that  cos(a,6)  is  the  cosine  of  the  angle 
between  vectors  a  and  6,  or  equivalently  |^j^. 

The  reason  we  need  to  modify  the  algorithm  is  this:  In  the  standard  algorithm,  if 
some  of  the  points  are  far  from  the  target  plane  (in  the  sense  that  cos{w*,x)  is  large) 
and  some  are  near,  then  eventually  the  hypothesis  will  correctly  classify  the  far  away 
points  but  may  make  mistakes  on  the  nearby  ones.  This  is  simply  because  the  points 
far  from  w*  •  x  =  0  cause  the  algorithm  to  make  substantial  progress  but  the  others 
do  not.  Unfortunately,  we  cannot  test  for  points  being  far  or  near  to  the  target  plane. 
So,  we  cannot  produce  the  rule:  “if  |  cos(tt)*,x)|  is  large  then  predict  based  on  x  •  u;. 
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else  say  ‘I  don’t  know’.”  What  we  want  instead  is  an  algorithm  that  does  well  on 
points  that  are  far  fi'om  the  hypothesis  plane,  because  |cos(rc,a:)|  is  something  that 
the  algorithm  can  calculate.  If  we  then  can  guarantee  that  a  reasonable  fraction  of 
points  will  have  this  property,  we  will  have  our  desired  weak  hypothesis  (just  replacing 
w*  by  w  in  the  above  rule). 

Specifically,  our  modified  algorithm  takes  as  input  a  quantity  a  and  its  goal  is  to 
produce  a  vector  w  such  that  every  misclassified  a:  6  S'  should  satisfy  |  cos(ic,a;)|  <  a. 
The  algorithm  proceeds  as  follows. 

The  Modified  Perceptron  Algorithm 

1.  Begin  with  w  as  a  random  unit  vector. 

2.  If  every  misclassified  x  e  S  satisfies  |  cos(u;,  a;)|  <  a  (i.e.,  if  |rc  •  <  cr|u;|)  then 

halt. 

3.  Otherwise,  pick  the  misclassified  x  ^  S  maximizing  |cos(ty,a;)|  and  update  w 
using: 

w  <—  w  —  {w  ■  x)x. 

In  other  words,  we  add  to  w  the  appropriate  multiple  of  x  so  that  w  is  now 
orthogonal  to  x,  i.e.,  we  add  the  multiple  of  x  that  shrinks  w  as  much  as 
possible. 

4.  If  we  have  made  fewer  than  (l/(T^)lnn  updates  then  go  back  to  Step  2.  Other¬ 
wise,  go  back  to  Step  1  (begin  anew  with  a  new  random  unit  starting  vector). 

Theorem  5  If  the  data  set  S  is  linearly  separable,  then  with  probability  1  — ^  the  mod¬ 
ified  perceptron  algorithm  halts  after  0((l/cr^)  ln(n)  ln(|))  iterations,  and  produces  a 
vector  w  such  that  every  misclassified  x  £  S  satisfies  |  cos(rt;,  a:)|  <  a. 

Proof.  Let  w*  be  a  unit  vector  that  correctly  classifies  all  x  £  S.  Suppose  it  is  the 
case  that  the  initial  (random  unit)  vector  w  satisfies  w  ■  w*  >  1/y/n.  Notice  that  in 
each  update  made  in  Step  (3),  w  •  w*  does  not  decrease  because 

(rc  —  (u)  •  x)x)  ■  w*  =  w  ■  w*  —  [w  ■  x)[w*  ■  x)  >  w  ■  w* 

where  the  last  inequality  holds  because  w  misclassifies  x.  On  the  other  hand,  \w\‘^ 
does  decrease  significantly  because  (this  is  just  the  Pythagorean  Theorem) 

|(rc  —  (u>  •  :r)x)p  =  \w\^  —  2{w  ■  x)^ -\- {w  ■  x-Y 

<  |rc|^(l  —  a^). 
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Thus,  after  t  iterations,  |m|  <  (1  —  Since  |u;|  cannot  be  less  than  w  •  w*^  this 

means  that  the  number  of  iterations  t  satisfies  (1  —  cr^yU  >  Ijy/n,  which  implies 
t  <  (lnn)/(T^. 

Each  time  we  choose  a  random  initial  unit  vector  for  u>,  there  is  at  least  a  constant 
>  0  probability  that  w  satisfies  our  desired  condition  that  w  ■  w*  >  If y/n.  Thus,  the 
theorem  follows.  □ 

We  have  described  the  algorithm  as  one  that  runs  in  expected  polynomial  time. 
Alternatively  we  could  stop  the  algorithm  after  a  suitable  number  of  iterations  and 
have  a  high  probability  of  success.  In  Section  2.3.5  we  will  alter  this  algorithm  slightly 
to  make  it  tolerant  to  random  classification  noise. 

2.3.4  A  new  guarantee  for  an  old  algorithm 

The  Modified  Perceptron  Algorithm  can  be  combined  with  the  Outlier  Removal 
Lemma  in  a  natural  way.  Given  a  data  set  S',  we  use  the  Lemma  to  produce  a 
set  S'  with  |S"|  >  IIS'!  and  such  that  for  all  vectors  tc,  max5/(?u  ■  <  /?E5/[(u;  •  x)^] 

where  jS  is  polynomial  in  n  and  b. 

We  then  reduce  dimensionality  if  necessary  to  get  rid  of  any  vectors  w  for  which 
the  above  quantity  is  zero.  That  is,  we  project  onto  the  subspace  L  spanned  by  the 
eigenvectors  of  the  XX^  matrix  having  non-zero  eigenvalue  (A"  is  the  matrix  of  points 
in  S').  Now,  we  perform  the  linear  transformation  A~^  described  in  Section  2.2  so 
that  in  the  transformed  space,  for  all  unit  vectors  w,  E5/[(u;-x)^]  =  1.  Our  guarantee 
for  set  S'  implies  that  in  the  transformed  space,  maxs/  |xp  <  (3n.  Thus,  for  any  unit 
vector  w, 

E5/[cos(m,x)^]  =  Es'^^ 

Fr 

^  E5/[(u;  •  x)^] 

~  max^/  \x\'^ 

>  l/ii3n). 

This  implies  that  in  the  transformed  space,  at  least  a  l/{2(3n)  fraction  of  the  points  in 
S'  satisfy  cos(u;,  x)^  >  l/(2/3n).  We  can  now  run  the  Modified  Perceptron  Algorithm 
with  <7  =  l/i/2/3?r,  and  guarantee  that  at  the  end,  at  least  a  l/(2/5?^)  fraction  of  the 
points  in  S'  satisfy  |  cos(i/;,x)|  >  cr. 

The  final  hypothesis  of  the  algorithm,  in  the  original  untransformed  space,  is:  if 
X  ^  L  OT  |cos(u>,  A“^x)|  <  a  then  guess  the  label  randomly  (or  say  “I  don’t  know”), 
and  otherwise  predict  according  to  the  hypothesis  w^A~^x  >  0. 
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Achieving  strong  (PAC)  learning 

The  algorithm  presented  above  splits  the  input  space  into  a  classification  region 

{x  :  X  E  L  and  |  cos(te,  >  ex} 

and  a  don’t-know  region 

{a;  :  a;  ^  L  or  I  cos(u;,  A~^x)\  <  c}. 

By  standard  VC-dimension  arguments  [59],  if  the  sample  S  is  drawn  from  distribu¬ 
tion  If,  then  for  any  e,  (^  >  0,  if  S  is  sufficiently  (polynomially)  large,  then  with  high 
probability  (>  1  —  the  true  error  of  the  hypothesis  inside  the  classification  region 
is  less  than  c.  Furthermore,  the  weight  under  D  of  the  classification  region  is  at  least 
lfpoly{n,b);  that  is,  the  fraction  of  S  that  lies  in  the  classification  region  is  repre¬ 
sentative  of  the  weight  of  this  region  under  D.  Therefore,  we  can  boost  the  accuracy 
of  the  learning  algorithm  by  simply  running  it  recursively  on  the  distribution  D  re¬ 
stricted  to  the  don’t-know  region.  The  final  hypothesis  produced  by  this  procedure  is 
a  decision  list  of  the  form:  “if  the  example  lies  in  the  classification  region  of  hypoth¬ 
esis  1,  then  predict  using  hypothesis  1,  else  if  the  example  lies  in  the  classification 
region  of  hypothesis  2,  then  predict  using  hypothesis  2,  and  so  on”. 

2.3.5  Learning  with  noise 

We  now  describe  how  the  Modified  Perceptron  Algorithm  can  be  converted  to  one 
that  is  robust  to  random  classification  noise.  We  do  this  by  recasting  the  algorithm  in 
the  Statistical  Query  (SQ)  model  of  Kearns  [42]  as  extended  by  Aslam  and  Decatur 
[4],  and  to  use  the  fact  that  any  SQ  algorithm  can  be  made  tolerant  of  random 
classification  noise. 

We  begin  with  some  observations.  For  convenience,  in  the  discussion  below  we 
will  normalize  the  examples  to  all  have  length  1,  so  that  we  need  not  distinguish 
between  x  and  x.  Recall  that  £{x)  =  1  if  x  is  a  positive  example  and  £(x)  —  —1  if  x 
is  a  negative  example. 

The  first  observation  is  that  the  only  properties  of  the  point  x  selected  in  Step  3  of 
the  Modified  Perceptron  Algorithm  that  are  actually  used  in  the  analysis  of  Theorem 
5  are: 


cos{w ,  x)£{x)  <  — c7,  and 

cosiw* .,  x)£{x)  >  0. 


(2.8) 

(2.9) 
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The  second  observation  is  that,  in  fact,  we  only  need  points  that  approximately  achieve 
these  two  properties.  In  particular,  suppose  that  every  point  x  we  use  in  Step  3 
satisfies  the  relaxed  conditions: 


cos{w,x)i{x)  <  —(jI2,  and  (2.10) 

2 

cos(u;*,a:)^(a:)  >  _  ^ - .  (2.11) 

lb\/n\nn 

The  first  condition  guarantees  that  after  t  =  (81n?r)/<T^  iterations  we  have  |zc|  < 
(1  —  {aj2yyU  <  l/n.  The  second  guarantees  that  if  initially  ww*  >  then  after 
t  iterations  w-w*  >  Therefore,  we  are  guaranteed  to  halt  before 

t  iterations  have  been  made. 

The  final  observation  is  that  any  positive  multiple  of 

=  E5[^(a:)a:.:  cos(w ,  x)i{x)  <  — cr] 

will  satisfy  conditions  (2.8)  and  (2.9),  assuming  zero  noise  so  that  every  x  E  S  satisfies 
(2.9),  and  if  we  define  £{iJ.w,s)  =  1-  Furthermore,  any  point  sufficiently  near  to 
will  satisfy  the  relaxed  conditions  (2.10)  and  (2.11).  Specifically,  the  definition  of 
the  fact  that  all  examples  have  length  1,  and  condition  (2.8)  together  imply 
that  \lJ,uj,s\  >  CT-  So,  any  point  such  that  \fiw,s  -  IJ'w,s  \  <  c7^/(16A/nlnn)  satisfies 
conditions  (2.10)  and  (2.11). 


Statistical  queries 

Let  f  he  &  function  from  labeled  examples  to  [0, 1].  That  is,  in  our  setting, 

/  {-1,1} 10,1], 

A  statistical  query  is  a  request  for  the  expected  value  of  /  over  examples  drawn 
from  distribution  D  and  labeled  according  to  the  target  concept  c;  i.e.,  a  request 
for  ExeD[fix,c{x))].  Assuming  that  /  is  polynomial-time  computable,  it  is  clear 
that  given  access  to  non-noisy  data,  this  expectation  can  be  estimated  to  any  de¬ 
sired  accuracy  r  with  any  desired  confidence  1  —  in  time  po/y(L,log(|)),  by  simply 
calculating  the  expectation  over  a  sufficiently  large  sample.  Kearns  [42]  and  Aslam 
and  Decatur  [4]  prove  that  one  can  similarly  perform  such  an  estimation  even  in  the 
presence  of  random  classification  noise. ^  Specifically,  for  any  noise  rate  p  <  1/2  and 

^Kearns  [42]  considers  queries  with  range  {0, 1}.  Aslam  and  Decatur  [4]  extends  these  arguments 
(among  other  things)  to  queries  with  range  [0, 1],  which  is  more  convenient  for  our  purposes. 
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any  accuracy  (or  tolerance)  parameter  r,  the  desired  expectation  can  be  estimated 
with  confidence  1  —  in  time  (and  sample  size)  po/j/(i,  log(|),  Thus,  to  prove 

an  algorithm  tolerant  to  random  classification  noise,  it  suffices  to  show  that  its  use  of 
labeled  examples  can  be  recast  as  requests  for  approximate  expectations  of  this  form. 

The  Modified  Perceptron  Algorithm  uses  labeled  examples  in  two  places.  The  first 
is  in  Step  2  where  we  ask  if  there  are  any  points  x  ^  S  such  that  cos{w,  x)£(x)  <  —cr, 
and  we  halt  if  there  are  none.  We  can  replace  this  with  a  statistical  query  request¬ 
ing  the  probability  that  a  random  labeled  example  from  D  satisfies  this  property 
(formally,  a  request  for  Ex£D[fix,c{x))]  where  f{x,£)  =  1  if  cos{w,x)£  <  —a  and 
f{x,i)  =  0  otherwise)  and  halting  if  this  probability  is  sufficiently  small.  Specifically, 
we  can  set  r  =  |e/(2^n)  and  halt  if  the  result  of  the  query  is  at  most  |e/(2/5n), 
where  l/(2/?n)  is  a  lower  bound  on  Pr^j^od  cos(u;,  a;)|  >  a)  from  the  Outlier  Removal 
Lemma. 

The  second  place  that  labeled  examples  are  used  is  in  Step  3.  As  noted  in  the 
discussion  following  equations  (2.10)  and  (2.11),  it  suffices  for  this  step  to  use  a 
good  approximation  to  instead  of  using  any  specific  labeled  example.  We  can 
find  such  an  approximation  via  statistical  queries.  Specifically,  to  approximate  the 
fth  coordinate  of  Hw.s,  we  ask  for  Ea,gD[f(x)a;j  |  cos(i(;,  x)£(x)  <  —cr].  This  condi¬ 
tional  expectation  can  be  approximated  as  the  ratio  of  the  answers  to  the  following 
two  statistical  queries.  One  is  a  request  for  Ea;e£)[/(a:,  c(x))]  where  f{x,l)  =  £x  if 
cos{w,x)  <  —a  and  f{x,i)  —  0  otherwise.  The  other  is  Pr[cos(re,  x)f(x)  <  — cr]  which 
we  saw  how  to  calculate  in  Step  2.  Note  that  we  are  guaranteed  from  Step  2  that 
Pr(cos(u;,  x)i{x)  <  —cr)  is  reasonably  large.  Finally,  we  combine  the  approximations 
for  each  coordinate  into  an  approximation  jlu>,s  of  /Wu.,s- 

Note  that  examples  are  also  used  in  the  algorithm  for  the  Outlier  Removal  Lemma. 
However,  since  this  algorithm  ignores  the  labels,  it  is  unaffected  by  classification  noise. 


2.4  Experiments  and  other  potential  applications 

Even  though  the  theoretical  bounds  established  in  the  previous  sections  are  not  small 
enough  to  directly  imply  the  practicality  of  our  techniques,  the  algorithms  seem  to 
do  quite  well  in  practice. 

We  conducted  the  two  different  experiments  to  gather  empirical  evidence.  Both 
were  based  on  the  perceptron  algorithm. 


•  Efficiency. 
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DATA.  Synthetically  generated,  linearlj^-separable  set  of  points,  with  a  small 
number  of  outliers  (10%  -  20%).  We  generated  a  set  of  points  at  random,  then 
inserted  some  outliers  bj^  hand. 

We  ran  the  perceptron  algorithm  on  data  sets  of  various  sizes  (using  a  random 
misclassified  example  at  each  iteration).  Then  we  removed  the  outliers,  and  ran 
it  again  on  the  outlier-free  data,  then  incorporated  the  outliers  (usually  this  took 
only  a  few  extra  iterations).  We  considered  a  couple  of  different  variations  of 
the  algorithm,  e.g.  with  the  misclassified  examples  normalized  to  unit  length  at 
each  iteration,  and  with  the  outlier-free  data  transformed  so  that  the  expected 
squared  distance  was  1  in  every  direction.  In  almost  every  case  the  number  of 
iterations  to  convergence  dropped  to  50%  of  the  original  number  (and  lesser  in 
some  cases). 

•  Quality. 

DATA.  Breast  Cancer  measurements  [62]  on  569  patients  with  thirty  attributes 
of  measurements  on  the  cell  nucleus  (such  as  radius,  texture,  and  smoothness) 
and  one 'field  indicating  the  diagnosed  nature  of  the  cancer,  “Benign”  or  “Ma¬ 
lignant”  . 

On  running  the  perceptron  algorithm  on  the  data  set  for  different  numbers  of 
iterations,  the  best  half-space  we  could  find  correctly  classified  79%  of  the  data 
set.  Next  we  set  the  the  outlier  parameter,  /?  =  5,  and  removed  the  outliers. 
There  were  364  points  left  (the  computation  took  a  few  seconds  in  the  MATLAB 
environment  on  an  IBM  PowerPC).  Then  we  ran  the  algorithm  on  the  outlier- 
free  data  and  tested  the  half-space  obtained  on  the  entire  data  set.  It  classified 
91%  of  the  data  correctly 

Remarks:  It  is  possible  that  there  exists  an  even  better  half-space  for  this  data 
set.  On  just  the  outlier-free  data,  the  half-space  performed  even  better. 


While  these  experiments  are  not  in  any  way  conclusive,  they  indicate  that  outlier 
removal  might  have  many  more  ramifications  than  our  theorems  imply.  One  thing  is 
clear  —  they  call  for  more  extensive  experiments. 

^It  is  a  mystery  to  me  that  there  should  exist  such  a  good  half-space  for  separating  the  benign 
and  malignant  cases  of  breast  cancer 
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2.5  Conclusion  and  open  problems 

In  this  chapter  we  have  seen  evidence  that  outliers  could  play  a  critical  role  in  the 
performance  and  speed  of  an  algorithm.  We  presented  a  method  to  find  outliers 
defined  in  a  strong  way  and  proved  theoretical  guarantees  about  the  method. 

Unfortunately  these  guarantees  are  rather  weak  at  the  moment.  Can  the  ratio 
between  the  maximum  and  the  average,  /?,  in  the  outlier  removal  lemma  be  reduced 
to  a  smaller  polynomial?  Our  experiments  suggest  that  the  true  bound  might  be 
significantly  better. 

Can  we  provide  theoretical  guarantees  for  outlier  removal  in  other  contexts  such 
as  nearest  neighbor  search?  Very  recently,  Edith  Cohen  showed  that  the  Outlier 
Removal  Lemma  can  be  used  to  learn  a  noisy  half-space  as  a  single  half-space  [17]. 
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Chapter  3 

Reducing  Dimensionality  by 
Random  Projection  I 


We  use  random  projection  to  1-dimensional  subspaces  to  identify  the  relevant  subspace 
in  an  algorithm  for  learning  the  intersection  of  half-spaces. 


3.1  Introduction:  the  intersection  of  half-spaces 

In  this  chapter  we  consider  the  problem  of  learning  the  intersection  of  k  half-spaces  in 
n  dimensions  from  labelled  examples.  We  are  presented  with  points  in  n-dimensional 
space  each  labelled  positive  or  negative.  The  problem  is  to  find  a  set  of  k  half-spaces 
such  that  all  the  positive  examples  lie  in  a  single  region  of  intersection  of  the  k  half- 
spaces  and  all  the  negative  examples  lie  outside  this  region,  if  such  a  set  of  half-spaces 
exists.  For  k  =  1,  this  corresponds  to  learning  a  single  half-space  (also  called  a  percep- 
tron).,  which  we  considered  in  the  previous  chapter.  As  we  observed,  learning  a  single 
perceptron  is  equivalent  to  linear  programming  and  hence  can  be  solved  in  polynomial 
time.  Other  solutions  to  this  problem,  notably  the  perceptron  algorithm,  have  also 
been  studied  in  the  literature  [12]  (as  we  saw  in  the  last  chapter).  While  perceptrons 
themselves  are  highly  interesting  and  in  fact  have  directly  found  applications,  it  is 
often  the  case  that  one  needs  a  more  complex  concept  class  to  accurately  model  some 
phenomenon.  The  intersection  of  k  half-spaces  is  a  natural  generalization  of  a  percep¬ 
tron.  Besides  its  intuitive  geometric  appeal,  in  principle  any  convex  concept  could  be 
approximated  by  an  intersection  of  half-spaces.  It  also  approximates  a  simple  neural 
network  which  is  used  in  many  machine  learning  applications. 

What  is  the  complexity  of  learning  the  intersection  of  k  half-spaces?  There  are 
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several  negative  results  about  this  [9, 14,  46,  49].  In  the  distribution-free  model,  where 
we  make  no  assumptions  on  the  distribution  from  which  examples  are  presented  to  us, 
and  under  the  requirement  that  the  algorithm  must  produce  a  set  of  k  half-spaces,  the 
problem  cannot  be  solved  in  polynomial  time  even  for  =  2  unless  RP  =  NP  [14,  49]. 

On  the  other  hand,  for  special  distributions  there  are  some  positive  results.  Baum  [8] 
gave  an  algorithm  for  learning  the  intersection  of  two  homogenous  half-spaces  (a  half¬ 
space  is  homogenous  if  the  hyperplane  defining  it  passes  through  the  origin)  over  any 
distribution  T>  that  is  origin-symmetric,  i.e.,  for  any  x  e  D{x)  -  V{-x)  (any 
point  X  and  its  reflection  through  the  origin  are  equally  likely).  Recently,  Blum  and 
Kannan  [13]  gave  a  polynomial  time  algorithm  for  the  problem  for  any  constant  num¬ 
ber  of  half-spaces  for  the  uniform  distribution  on  the  unit  ball  in  n  dimensions.  Their 
algorithm  does  not  explicitly  find  the  half-spaces,  instead  it  finds  a  prediction  rule 
which  can  be  evaluated  in  polynomial  time  (for  a  constant  number  of  half-spaces) 
and  is  probably  approximately  correct.  The  running  time  and  number  of  examples 
required  by  the  algorithm  are  doubly  exponential  in  k. 

We  present  a  randomized  algorithm  for  the  problem.  Besides  being  simpler,  our 
algorithm  improves  on  the  previous  one  in  three  ways: 

•  It  is  faster:  the  running  time  and  number  of  examples  required  are  (singly) 
exponential  in  k  and  polynomial  in  n.  Hence  we  can  learn  the  intersection  of 
up  to  (9(log  n/log  log  n)  half-spaces  in  polynomial  time. 

•  The  concept  that  it  reports  is  shorter:  our  algorithm  explicitly  finds  an  (inter¬ 
section  of  a)  set  of  0{k)  half-spaces. 

•  It  can  handle  more  general  distributions  (for  a  constant  number  of  half-spaces): 
specifically  any  distribution  on  the  unit  ball  that  is  not  “concentrated”,  i.e., 
whose  probability  density  is  at  least  Ij poly{n)  and  at  most  poly{n)  everywhere. 

Although  our  algorithm  is  quite  different  from  that  of  Blum  and  Kannan  it  is 
inspired  by  the  the  same  observation:  the  true  complexity  of  the  problem  is  deter¬ 
mined  not  by  the  dimension  n  or  the  number  of  half-spaces  k,  but  by  the  dimension 
of  the  subspace  spanned  by  their  normal  vectors  to  the  defining  hyperplanes.  Indeed 
our  algorithm  can  learn  the  intersection  of  any  number  of  half-spaces  so  long  as  their 
normals  span  a  subspace  of  dimension  O(log  n/ log  logjr). 

To  explain  the  idea,  let  us  assume  that  the  half-spaces  are  homogenous,  i.e.,  the 
hyperplanes  defining  them  pass  through  the  origin  (this  is  actually  without  loss  of 
generality  for  us  as  shown  in  section  3.3.5).  Let  the  half-spaces  he  wi  ■  x  <  0,W2  ■  x  < 
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0, . . .  ,Wk  •  X  <  0.  The  intersection  of  these  half-spaces  is  the  positive  region  P. 
For  each  half-space  Wi  ■  x  <  0,  Wi  is  the  normal  vector  to  the  hyperplane  defining 
the  half-space  (lying  in  the  region  opposite  the  positive  region  with  respect  to  this 
hyperplane).  Our  goal  will  be  to  find  a  set  of  normal  vectors  that  are  very  close  in 
angle  to  Wi, . . .  ,Wk.  For  this  we  consider  the  set  of  all  normal  vectors  that  define 
hyperplanes  through  the  origin  which  do  not  intersect  the  positive  region  P.  This  is 
precisely  the  cone  at  the  origin  formed  by  the  vectors  rci, . . .  Formally,  it  is  the 
set  Dp  of  all  vectors  v  such  that  v  =  aj  >  0.  Then  each  vector  v  G  Dp 

has  the  property  that  for  any  positive  example  x,  v  •  x  <  0.  In  other  words  Dp  is  the 
set  of  normal  vectors  of  hyperplanes  that  do  not  intersect  P.  In  linear  programming 
theory  P  and  Dp  are  dual  to  each  other. 

The  first  step  of  the  algorithm  is  to  find  a  good  approximation  to  Dp.  Although 
the  minimum  dimension  of  a  subspace  containing  Dp  is  at  most  k  (it  could  be  less) 
we  first  find  a  good  “approximation”  to  Dp  in  n  dimensions.  This  is  done  by  simply 
choosing  a  large  sample  of  examples  and  considering  the  dual  Dc  of  their  conical 
hull  C,  i.e.,  the  set  of  homogenous  hyperplanes  that  do  not  intersect  the  convex  hull 
of  the  examples.  In  the  second  step  we  apply  a  simple  procedure  based  on  random 
sampling  to  identify  a  A:-dimensional  subspace  close  to  the  subspace  containing  Dp. 
Then  we  project  our  n-dimensional  approximation  of  Dp  to  this  relevant  subspace. 
Let  this  projection  be  Dq.  The  next  step  is  choose  vectors  from  Dc  to  guarantee 
that  for  each  Wi  there  is  at  least  one  vector  in  the  sample  close  to  it  (in  angle).  We 
could  do  this  by  simply  considering  all  points  of  a  sufficiently  fine  grid  enclosing  Dc- 
The  size  of  the  sample  is  chosen  to  be  large  enough  to  guarantee  that  for  each  Wi 
there  is  at  least  one  vector  in  the  sample  close  to  it  (in  angle).  Finally  we  prune  the 
sample  using  a  greedy  heuristic.  The  half-spaces  defined  by  the  vectors  in  the  pruned 
sample  constitute  the  concept  output  by  the  algorithm.  In  other  words,  we  label  a 
point  positive  if  it  lies  in  the  intersection  of  these  half-spaces  and  negative  otherwise. 

For  most  of  the  discussion  we  assume  that  the  half-spaces  we  are  trying  to  learn 
are  homogenous.  In  the  next  section  we  introduce  the  framework  and  notation.  Then 
we  describe  our  algorithm  in  detail.  Section  3.3.1  is  devoted  to  proving  a  property 
we  need  of  a  large  sample  of  examples.  Section  3.3.2  outlines  a  proof  of  the  sampling 
procedure  used  to  find  the  relevant  subspace.  Section  3.3.3  describes  the  grid  we  use 
for  sampling  and  a  bound  on  the  size  of  the  sample  we  need.  Section  3.3.4  discusses  the 
final  pruning  step.  Section  3.3.5  shows  how  to  reduce  the  non-homogenous  case  to  the 
homogenous  one  (this  does  not  work  in  general,  only  for  the  restricted  distributions 
we  can  handle).  In  a  brief  concluding  section  we  mention  possible  extensions  of  this 
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work  and  some  related  questions. 


3.1.1  Preliminaries 

We  adopt  the  terminology  used  in  the  literature.  To  recap  cpickly,  an  example  is  a 
point  in  P”;  a  concept  is  a  subset  of  ii".  An  example  that  belongs  to  a  concept  is 
a  positive  example  for  the  concept,  and  an  example  that  lies  outside  the  concept  is  a 
negative  example.  Given  a  set  of  labelled  examples  drawn  from  an  unknown  distri¬ 
bution  D  in  i?”,  and  labelled  according  to  an  unknown  target  concept  the  learning 
task  is  to  learn  the  target  concept.  What  this  means  is  that  given  an  error  parameter 
e  and  a  confidence  parameter  with  probability  at  least  1  —  (^  the  algorithm  has  to 
find  a  concept  that  has  error  at  most  e  on  D. 

For  us  the  target  concept  is  the  intersection  of  /  half-spaces  in  BT ,  such  that 
the  normal  vectors  to  the  hyperplanes  bounding  these  half-spaces  span  a  subspace  of 
dimension  k.  For  most  of  the  chapter  we  will  assume  that  the  hyperplanes  bounding 
the  half-spaces  pass  through  the  origin.  Each  point  z  G  i?”  is  labelled  positive  or 
negative  according  to  the  following  rule: 


£{x)  =  -)-  if  Wx  <  0 
l{x)  =  —  otherwise. 

Here  W  is  a  real  matrix  of  rank  k  with  I  rows  and  n  columns.  Each  row  Wi 
represents  a  half-space  Wi  ■  a;  <  0.  Hence  a  point  x  is  labelled  positive  if  it  lies  in  the 
intersection  of  the  k  half-spaces  and  it  is  labelled  negative  otherwise.  Formally,  the 
positive  region  P  is: 

{x  eR"\xe  Bn,Wx  <  0} 

Examples  presented  to  the  algorithm  are  drawn  from  some  unknown  distribu¬ 
tion  on  the  unit  ball  in  n  dimensions,  We  assume  that  the  distribution  is  non¬ 
concentrated,  meaning  that  its  probability  density  is  at  least  llpoly{n)  and  at  most 
poly{n)  everywhere  in  the  unit  balF.  We  can  also  assume  that  P  occupies  at  least  an 
e  fraction  of  the  unit  ball  (otherwise  we  could  simply  output  the  concept  that  labels 
every  point  negative).  It  is  worth  noting  that  such  a  distribution  is  a  “good”  one  for 

^It  is  worth  noting  that  such  a  distribution  is  a  “good”  one  for  the  perception  algorithm  discussed 
in  the  previous  chapter.  We  could  set  the  a  of  the  algorithm  to  be  l/poly(n)  and  in  polynomial  time 
the  algorithm  would  find  a  half-space  that  classified  all  but  a  polynomial  fraction  correctly. 
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the  perception  algorithm  discussed  in  the  previous  chapter.  We  could  set  the  cr  of 
the  algorithm  to  be  l/po/y(ra)  and  in  polynomial  time  the  algorithm  would  find  a 
Let  Dp  denote  the  dual  to  the  cone  formed  by  the  positive  region. 

Dp  =  {v  e  :  V  =  >  0} 

i 

We  could  project  down  to  the  subspace  spanned  by  Dp  and  the  projections  tuj,  . . .  ,w 
of  the  rui’s  give  half-spaces  that  separate  the  positives  from  the  negatives  in  the  re¬ 
duced  space. 

We  say  that  a  convex  cone  K  in  is  t-enclosed  by  another  convex  cone  K'  if 
K  C  K',  and  for  every  point  x  ^  K'  f\  Bn  there  is  some  point  y  €  A'  fl  Bn  such  that 
the  angle  between  the  vectors  x  and  y  is  at  most  e.  In  other  words  for  each  x  G  K' 
there  is  a  y  in  K  such  that  the  angle  between  x  and  y  at  the  origin  is  at  most  e. 
Inversely  we  say  that  K'  e-encloses  K. 

The  projection  length  of  a  convex  body  K  along  a  unit  vector  (or  direction)  v  is 
the  length  of  the  1-dimensional  projection  of  K  onto  v.  Intuitively,  it  is  the  width  of 
the  body  in  the  direction  v.  Formally  it  is 

max  X  ■  V  —  mill  x  •  v 

x£K  x^K 

3.2  The  Algorithm 

The  input  to  the  algorithm  is  an  error  parameter  e,  a  confidence  parameter  5,  and  a 
set  of  labelled  examples.  The  output  of  the  algorithm  is  a  set  of  m  half-spaces. 

Here  we  give  a  high-level  description  of  the  algorithm.  Details  and  proofs  of 
individual  steps  are  in  later  sections.  The  parameters  e,  will  be  specified  later. 

1.  Approximate  the  dual  cone.  Let  S  be  the  set  of  positive  examples  presented 
to  the  algorithm.  Let  C  be  the  cone  formed  by  the  vectors  in  S  and  Dc  be  the 
dual  of  (7,  i.e.,  the  set  of  normal  vectors  to  hyperplanes  that  do  not  intersect 
C .  The  size  of  S  is  chosen  so  that  Dc  ei-encloses  Dp. 

2.  Identify  the  “irrelevant”  subspace.  We  do  this  by  finding  a  set  of  n  —  fc 
orthogonal  vectors  {xi,X2,  ■ . .  ,Xn-k}  ‘‘■s  follows:  Choose  a  set  of  random  unit 
vectors  and  let  rii  be  the  one  among  them  such  that  the  projection  length  of 
DcH  in  the  direction  of  xi  is  minimum.  Now  pick  unit  unit  vectors  orthogonal 
to  Xi  and  let  X2  be  the  one  among  them  with  the  minimum  projection  length 
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of  Dc  C  Bn.  In  this  way  at  step  i  we  find  a  vector  .Xi  that  is  orthogonal  to 
xi, . . . ,  x,:_i  and  the  projection  length  of  Dc  along  x,-  is  small.  After  n  —  k  steps, 
the  vectors  {xi, . . . ,  span  a  subspace  that  approximates  the  irrelevant 

subspace.  The  size  of  the  sample  is  chosen  so  that  at  each  vector  Xi  is  a.  small 
angle  (£2)  away  from  being  orthogonal  to  Dp. 

3.  Reduce  dimensionality.  Project  Dc  (implicitly)  to  the  subspace  orthogonal 
to  that  spanned  by  {xi,X2, . . .  ,x„_fc}.  Let  Dc  be  the  projection. 

4.  Sample  the  dual.  Consider  all  lattice  points  spaced  at  £3  units  in  a  box  that 
encloses  Dc-  In  other  words,  in  the  original  space  for  each  tc,-  there  is  a  vector 
corresponding  to  some  lattice  point  that  makes  an  angle  less  than  £3  with  Wi. 

5.  Prune  the  sample.  Let  U  be  the  random  sample  from  the  previous  step.  Let 

be  a  new  set  of  D,{nl)  examples.  Greedily  choose  a  subset  of  U  of  size  m  as 
follows:  let  Ui  be  the  vector  from  U  such  that  Ui  •  x  <  0  separates  the  largest 
number  of  negative  examples  from  the  positive  examples  in  S-y.  Discard  the 
negative  examples  that  are  separated  in  this  way  by  ui.  Let  U2  be  the  vector 
from  V  that  separates  the  largest  number  of  remaining  negative  examples  in 
Si  from  the  positives.  Similarly,  let  Ui  be  the  vector  from  U  that  separates  the 
largest  number  of  remaining  negatives.  The  number  of  steps  m  is  chosen  so  that 
the  number  of  remaining  negatives  after  m  steps  is  less  than  an  c/2  fraction  of 
the  initial  number.  Output  the  half-spaces  ui  •  x  <  0,  U2  •  x  <  0, . . . ,  •  x  <  0. 

Given  an  unlabelled  point  x,  we  project  it  to  Dc  and  then  label  it  positive  if  it 
lies  in  the  intersection  of  the  above  half-spaces  and  negative  otherwise. 

In  the  last  step,  alternatively,  we  could  extend  the  vectors  uj,. . .  to  vectors 
in  R”  and  output  the  corresponding  half-spaces  in  R". 

In  order  to  learn  a  constant  number  of  half-spaces  in  polynomial  time,  step  (4) 
of  the  algorithm  could  be  replaced  by  any  standard  method  to  learn  half-spaces  in 
constant-dimensional  space.  However,  by  this  approach  we  can  allow  the  relevant 
subspace  to  have  dimension  greater  than  a  constant. 


3.3  The  Analysis 

Our  main  theorem  is  a  performance  guarantee  for  this  algorithm  for  a  suitable  choice 
of  the  £j’s.  In  the  statement  below  p  >  1  is  a  parameter  of  the  distribution.  If  D 
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has  parameter  p  then  the  probability  density  everywhere  in  the  unit  ball  is  between 
-  and  p. 

Theorem  6  The  above  algorithm  PAC-learns  with  parameter  e  and  S  the  intersection 
of  I  half-spaces  whose  normals  lie  in  a  k-dimensional  subspace,  and  has  a  bound  of 

0(poi!,(n)iV(5)*logh 

e  0 

on  the  running  time  and  the  number  of  examples  required. 

For  nearly-uniform  distributions,  i,e.,  when  p  is  a  constant,  this  gives  us  a  polynomial¬ 
time  algorithm  for  k  up  to  about  log  n/ log  log  n. 

3.3.1  A  large  sample  of  examples 

In  this  section  we  derive  an  upper  bound  on  the  number  of  examples  required  by 
the  algorithm.  Let  S  be  the  set  of  examples.  Then  l^l  should  be  large  enough  to 
guarantee  that  the  dual  to  the  cone  formed  by  S  ci-encloses  the  dual  to  the  cone 
formed  by  P. 

Lemma  4  Let  D  be  a  non-concentrated  distribution  on  Let  S  be  a  random  sample 
of  positive  examples  from  this  distribution  and  C  be  the  cone  at  the  origin  formed  by 
the  points  in  S.  Let  D  be  the  dual  cone  of  C .  If 

|5|  =  n{p(n).(-)Mogi) 

Cl  0 

t 

where  p{n)  is  a  fixed  polynomial  that  depends  only  on  T>,  then  with  probability  at  least 
1  —  5,  the  cone  Dp  is  ti-enclosed  by  Dq. 

Proof.  Let  Dc,  the  dual  of  a  cone  C',  be  the  maximum  body  that  ci-encloses  Dp. 
Our  goal  is  to  show  that  the  dual  Dc  of  the  conical  hull  C  of  a  large  enough  sample 
S  will  be  contained  in  Dq  with  high  probability.  To  prove  this  we  show  that  for  each 
point  h  on  the  boundary  of  Dc,  there  is  a  supporting  plane  of  Dc  which  separates  h 
from  Dp. 

Let  h  •  x'  =  0  be  a  supporting  hyperplane  of  C  such  that  C  lies  in  the  half-space 
h  ■  x'  <  0  and  consider  the  convex  region  P  H  BnP  h  •  x'  >  0.  The  key  idea  is  to  show 
that  the  probability  that  there  is  a  point  in  the  sample  from  such  a  region  is  high.  This 
probability  is  the  total  probability  mass  assigned  to  this  part  of  the  unit  ball,  i.e.,  it  is 
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at  least  times  the  fraction  of  the  volume  of  the  unit  ball  occupied  by  this  region. 
To  calculate  the  fraction  of  the  unit  ball  occupied  by  this  region,  we  can  first  go  down 
to  the  {k  +  l)-dimensional  space  spanned  by  wi,W2,...,wi  and  h.  In  this  space  the 
volume  of  the  region  grows  with  Ci  roughly  as  at  least  ■~eiVol{Bk)  where  V ol{Bk) 
is  the  volume  of  the  unit  ball  (this  essentially  follows  from  the  observation  that  the 
positive  region  P  occupies  at  least  l/p(n)  fraction  of  Bk).  So  the  probability  that  any 
single  example  falls  in  the  region  is  Ul[e\ / .  Now  we  use  the  VC-dimension  of  the 
intersection  of  up  to  /  +  1  half-spaces  to  complete  the  proof.  The  VC  theorem  implies 
that  if  we  consider  a  sample  of  size  then  with  high  probability  every  concept 

in  the  class,  i.e.,  an  intersection  of  every  set  of  /  +  1  half-spaces  with  more  than  e 
probability  will  see  at  least  one  example  in  the  sample.  We  set  e  in  the  theorem  to 
be  t\lpin)  to  complete  the  proof.  □ 


3.3.2  Identifying  the  relevant  subspace 

In  this  section  we  analyze  the  procedure  to  approximately  identify  the  irrelevant 
subspace  and  hence  the  subspace  spanned  by  Dp  . 

The  first  step  is  to  find  a  vector  Xi  such  that  the  projection  length  of  Dc  onto 
Xi  is  small.  For  this  we  do  a  the  following.  Pick  a  set  of  random  unit  vectors,  and 
find  the  projection  length  of  Dc  H  Bn  along  these  vectors.  Note  that  computing  the 
projection  length  of  Dc  H  onto  a  vector  x  can  be  done  efficiently:  we  need  to 
calculate  maXyer)cnB„  x  ■  y  and  minygo^,nB„  x  ■  y,  both  of  which  are  convex  programs 
with  simple  separation  oracles  [34] 

Let  xi  be  the  direction  along  which  the  projection  length  is  minimum.  We  estimate 
the  probability  that  xi  is  nearly  orthogonal  to  Dp  using  the  following  fact. 


Fact  1  The  volume  of  the  n- dimensional  ball  of  radius  r  is  equal  to  and  its 

^  nl {n/2) 

j.  .  2r”-lTr"/2 

surface  area  ts  Y  . 

■>  r(n/2) 


Lemma  5  Let  Vi^V2-, . .  .Vk  be  orthogonal  unit  vectors  in  i?"  and  let  a  <  Then 
the  probability  that  a  random  unit  vector  x  satisfies  \vi  •  x\  <  cx,  for  all  i,  is 

fl((l  - 


In  other  words  the  lemma  lower  bounds  the  probability  that  a  random  unit  vector 
is  nearly  orthogonal  to  each  of  a  fixed  set  of  k  vectors. 

Proof.  Consider  the  set  of  of  unit  vectors  x  such  that  <  a  for  allt  =  1, . . . ,  k. 

This  is  the  intersection  of  the  unit  ball  Bn  with  2k  half-spaces.  The  intersection 
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contains  a  ^^band”  which  has  as  its  base  an  n  — A:  dimensional  ball  of  radius  y{l~-ka^) 
and  a  thickness  of  2a  in  the  other  k  orthogonal  dimensions.  A  simple  calculation  based 
on  fact  1  gives  the  bound.  □ 

Now  we  sample  only  from  vectors  orthogonal  to  Xi^  record  the  vector  X2  with  the 
minimum  projection  length  and  repeat  this  to  find  0:2, .. .  Let  Dc  denote  the 

projection  of  Dc  to  the  subspace  orthogonal  to  Xi,.T2,  . . .  ^Xn-k- 

From  the  lemma  it  follows,  for  example,  that  if  we  pick  n/ (3^  random  unit  vectors 
then  with  high  probability  one  of  them  will  have  a  dot  product  of  magnitude  less  than 
with  each  of  a  fixed  set  of  k  orthogonal  unit  vectors.  By  letting  the  k  vectors 
to  be  a  set  of  basis  vectors  of  Dp  we  have  that  w.h.p.  xi  will  be  nearly  orthogonal 
to  Dp. 

Lemma  6  Assume  the  projection  length  of  Dc  fl  Bn  along  any  direction  orthogonal 
to  Dp  is  .at. most  a/2  and  along  any  direction  in  the  subspace  spanned  by  Dp  is  at 
least  a.  Then  a  sample  of  n/a^  random  unit  vectors  to  find  each  Xi  guarantees  that 
the  vectors  Xi.,..^Xn~k  a,re  almost  orthogonal  to  Dp,  i.e.,  \vi  *  Xj\  <  ,  —  —  for 

y/n+l-j-k 

i  =  1, . . . ,  A:  and  j  =  1, . . .  ,n  ~~  k.  Further  any  unit  vector  w  G  Dp  has  a  projection 
w  G  Dc  such  that  w  •  w  >  I  —  a^k  log  n. 


Proof  From  lemma  5  the  vector  Xi  satisfies  Vi  •  xi  <  ^7=^.  For  the  purpose  of 
analysis  we  can  view  the  second  step  of  the  algorithm  as  first  projecting  Dc  to  the 
subspace  orthogonal  to  xi  and  then  sampling  from  all  unit  vectors  in  that  subspace. 
The  projection  of  Vj  is  Vi  —  {vi  •  X\)x\.  So  again  from  the  lemma  we  have 

a 

\vi  ■  X2I  =  |[ui  -  {vi  ■  a;i).Ti]  •  X2I  <  . 

Vn  —  1  —  k 

At  the  step,  the  projection  of  Vi  is  Vi  —  J2]z\{v{  •  Xj)xj  and  it  follows  that  \vi  -  Xj\  < 

_ a _ 

y/n-\-l~t~k  ' 

To  prove  the  second  part,  for  a  unit  vector  w  G  Dp,  w  =  w  —  So 

w  •  w  =  1  —  ■  ^i)^-  We  rewrite  w  as  Yl^=\  where  =  1  ^-nd  then 

k 

=  I  X]  '  ^i)l 

i=l 


< 


a 


y/n  +  1  -  j  ~  k  "7 


£ 


a\ 


n  1  —  j  ~  k 


We  plug  this  into  the  expression  for  tc  •  u;  to  get  that  w  •  w  >  1  —  a‘^k\og  n.  □ 
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The  first  assumption  of  the  lemma  can  be  satisfied  by  setting  Ci  =  62/2  in  the 
previous  step  of  the  algorithm.  The  second  assumption  is  not  really  restrictive.  If  it  is 
not  true,  then  that  means  that  we  can  reduce  the  problem  to  one  in  A:  —  1-dimensional 
space.  By  setting  a  =  ^  we  get  that  each  w  G  Dp  has  a  projection  w  G  Dc  at 

an  angle  of  no  more  than  €2  for  small  values  of  (.2. 

3.3.3  Sampling  the  dual 

The  next  step  is  to  find  a  sample  the  projected  dual  so  that  there  is  a  point  in  the 
sample  within  €3  of  each  wj .  We  do  this  by  simply  finding  a  sample  that  has  point 
close  to  every  point  of  Dc  fl  Bk-  Let  be  the  integer  lattice  in  k  dimensions,  and 
let  be  a  scaled  down  integer  lattice  where  the  least  spacing  between  points  is  63. 
Then  the  sample  we  consider  is  {Dc  H  Bk  fl  e^Z^.  It  is  easy  to  see  that  the  size  of  the 
sample  is  bounded  by  (^)^- 

3.3.4  Pruning  the  sample  of  normal  vectors 

Here  we  show  that  m  =  0{l).  Let  be  a  fresh  sample  of  examples.  From  standard 
y C-dimension  arguments  it  is  enough  to  find  a  set  of  half-spaces  that  have  an  error 
less  than  e  on  a  sample  whose  size  is  the  yC-dimension.  So  we  choose  IS*!]  to  be 
D{nl). 

Let  Vi,V2, . . .  ,vi  be  the  vectors  in  U  that  are  closest  in  angle  to  roj,  u;^, . . . ,  ro/ 
respectively.  Let  be  the  region  uj  •  a;  <  0, . . . ,  u/  •  a:  <  0.  Fy  is  the  positive  region 
according  to  these  vectors.  Assume  that  the  combined  error  of  these  half-spaces  with 
respect  to  the  actual  set  of  half-spaces  is  bounded  by  e/2.  In  other  words  there  is  a 
set  of  I  half-spaces  in  U  that  correctly  classify  a  (1  —  |)  fraction  of  the  distribution. 
This  can  be  achieved  by  setting  e2  -f  €3  =  ^  in  the  previous  steps  of  the  algorithm. 

Our  procedure  to  prune  U  is  the  following:  pick  the  best  u  from  U,  i.e.,  the  vector 
u  such  that  the  half-space  u  •  a:  <  0  separates  the  maximum  number  of  negatives 
from  the  positives  in  Si.  Call  it  uj.  Then  pick  the  best  u  for  the  remaining  negative 
examples  and  so  on.  From  fairly  standard  set  cover  guarantees  it  follows  that  a  greedy 
set  of  I  half-spaces  must  separate  at  least  half  as  many  as  the  best  set  of  I  half-spaces, 
and  a  set  of  I  log  r  greedy  half-spaces  separate  at  least  1  —  4  fraction  of  what  the  best 
set  of  I  half-spaces  can  separate.  We  formalize  this  in  the  following  theorem. 

Theorem  7  A  greedily  chosen  set  of  21  log  r  half-spaces  will  with  high  probability 
correctly  classify  at  least  (1  -  4)(1  -  |)  fraction  of  the  distribution  and  hence  achieve 
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PAC-learning. 

Setting  r  to  be  less  than  |  (say)  gives  us  a  set  of  0{l)  planes  that  correctly  classify 
1  —  e  of  the  distribution. 

3.3.5  The  non- homogenous  case 

The  discussion  so  far  has  assumed  that  the  half-spaces  we  are  trying  to  learn  are 
homogenous.  Of  course  this  may  not  be  the  case  in  general,  and  here  we  show  a 
simple  reduction  from  the  general  (non-homogenous)  case  to  the  homogenous  case  by 
going  to  a  representation  in  one  more  dimension,  i.e.,  in  The  key  issue  will 

be  to  make  sure  that  we  can  do  this  while  keeping  the  distribution  in  the  new  space 
non-concentrated.  We  make  two  observations  for  this. 

Our  first  observation  is  that  all  the  algorithm  needs  is  a  distribution  on  the  unit 
sphere  (rather  than  the  ball)  in  ii”  such  that: 

Any  convex  region  of  the  sphere  with  more  than  l/jt>(n)  fraction  of  the  area  of  the 
sphere  should  have  probability  mass  between  lfq[n)  and  q{n)  for  some  polynomials 
p(n),  q{n).  (Analogously,  any  ci  fraction  should  have  probability  mass  between  C2  and 
l/c2  for  constants  Ci,C2.) 

Our  second  observation  is  that  we  only  need  the  above  condition  to  hold  for  the 
part  of  the  unit  sphere  that  is  in  the  positive  region.  (Negative  examples  are  used 
only  in  the  last  step.) 

The  following  mapping  from  the  unit  ball  Bn  in  il”  to  the  unit  sphere  Sn  in 
satisfies  these  conditions. 


Vi  = 

— ^ -  for  1  -  1 

1)2 

yn+1  ~ 

1 

4-  1 

This  corresponds  to  blowing  up  Bn  by  a  factor  of  n  then  placing  it  on  the  plane 
Vn+i  =  1  nnd  then  mapping  it  stereographically  to  the  unit  sphere  (scaling  the  length 
of  the  image). 

The  new  coordinate  yn+i  lies  between  1  and  i-e.,  points  are  mapped  only 

to  the  part  of  the  sphere  in  the  half-space  yn+i  >  Suppose  the  positive  region 

in  i?"  is  given  by  the  set  of  (possibly  non-homogenous)  half-spaces  Wi  •  x  <  bi  for 

i  =  1,...,/. 
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Then  in  the  positive  region  is  given  by  the  half-spaces 


{wi,  -b,n)  •  y  <  0 

Vn+l  ^  /  2  I  1 

Yn^  +  1 

This  is  then  approximated  by  the  homogenous  half-spaces  (iCj-,  —bin)  •  y  <  0  and 
Vn+i  >  0  and  we  can  run  the  algorithm  to  learn  them.  The  “blow-up”  factor  n  could 
be  replaced  by  any  poly{n).  Note  that  it  does  not  matter  that  only  (about)  half  the 
sphere  is  used. 


3.4  Conclusion  and  open  problems 

We  have  seen  the  simplicity  and  utility  of  random  projection  as  applied  to  a  classical 
problem  in  learning  theory.  In  the  next  chapter  we  study  random  projection  in 
a  rather  different  context,  namely  information  retrieval.  Random  projection  also 
seems  to  be  a  natural  scheme  for  rounding  semidefinite  relaxations  to  vertex-ordering 
problems.  Recently,  Kleinberg  [45]  gave  an  algorithm  for  finding  approximate  nearest 
neighbors  using  similar  techniques.  It  is  my  feeling  that  other  applications  are  waiting 
to  be  discovered.  We  conclude  this  chapter  with  some  questions  about  the  algorithm 
presented  here. 

Can  the  running  time  of  the  algorithm  for  learning  the  intersection  of  half-spaces 
be  improved  to  polynomial  in  ^  ?  Indeed  an  algorithm  that  is  fully  polynomial,  i.e., 
polynomial  in  k  as  well,  might  be  possible  for  non-concentrated  distributions.  I  do 
not  know  any  hardness  reduction  that  would  make  this  unlikely. 

Can  we  learn  the  intersection  of  k  hyperplanes  when  the  examples  are  drawn  from 
the  uniform  distribution  on  the  vertices  of  a  hypercube?  This  does  not  induce  a  non- 
concentrated  distribution  on  the  sphere  and  so  the  methods  in  this  chapter  cannot 
be  applied  directly.  One  reason  that  the  problem  is  interesting  for  this  distribution 
is  that  it  includes  as  a  special  case  the  problem  of  learning  DNF-formulae. 


Chapter  4 

Reducing  Dimensionality  by 
Random  Projection  II 


We  use  random  projection  to  quickly  approximate  the  eigenspace  of  a  matrix  and 
apply  it  to  speeding  up  the  information  retrieval  technique  known  as  Latent  Semantic 
Indexing, 


4.1  Introduction:  Information  Retrieval 

The  complexity  of  information  retrieval  is  best  illustrated  by  the  two  nasty  classical 
problems  of  synonymy  (missing  documents  with  references  to  “automobile”  when 
querying  on  “car”)  and  polysemy  (retrieving  documents  about  the  Internet  when 
querying  on  “surfing”).  One  possible  approach  to  dealing  with  these  two  problems 
would  be  to  represent  documents  (and  queries)  not  by  terms  (as  in  conventional 
vector-based  methods),  but  by  the  underlying  (latent ^  hidden)  ^^concepts^^  referred  to 
by  the  terms.  This  hidden  structure  is  not  a  fixed  many-to-many  mapping  between 
terms  and  concepts,  but  depends  critically  on  the  corpus  (document  collection)  in 
hand,  and  the  term  correlations  it  embodies. 

Latent  Semantic  Indexing  (LSI)  [19]  is  an  information  retrieval  method  which 
attempts  to  capture  this  hidden  structure  by  using  techniques  from  linear  algebra. 
Vectors  representing  the  documents  are  projected  in  a  new,  low-dimensional  space 
obtained  by  singular  value  decomposition  of  the  term-document  matrix  A.  This  low¬ 
dimensional  space  is  spanned  by  the  eigenvectors  of  A^A  that  correspond  to  the  few 
largest  eigenvalues  —  and  thus,  presumably,  to  the  few  most  striking  correlations 
between  terms  (see  Section  4.1.1  for  a  brief  description  of  the  technique).  Queries 
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are  also  projected  and  processed  in  this  low-dimensional  space.  This  results  not 
only  in  great  savings  in  storage  and  query  time  (at  the  expense  of  some  considerable 
preprocessing),  but  also,  according  to  empirical  evidence  reported  in  the  literature,  to 
improved  information  retrieval  [10,  22,  23].  Indeed,  it  has  been  repeatedly  reported 
that  LSI  outperforms,  with  regard  to  precision  and  recall  in  standard  collections  and 
query  workloads,  more  conventional  vector-based  methods. 

There  is  very  little  in  the  literature  in  the  way  of  a  mathematical  theory  that 
predicts  this  improved  performance.  An  interesting  mathematical  fact  due  to  Eckart 
and  Young  (stated  below  as  Theorem  8)  which  is  often  cited  as  an  explanation  of 
the  improved  performance  of  LSI  states,  informally,  that  LSI  retains  as  much  as 
possible  the  relative  position  of  the  document  vectors.  This,  however,  may  only 
provide  an  explanation  of  why  LSI  does  not  deteriorate  too  much  in  performance  over 
conventional  vector-space  methods;  it  fails  to  justify  the  observed  improvement. 

This  is  a  first  attempt  at  using  mathematical  techniques  to  rigorously  explain 
the  improved  performance  of  LSI  (Section  4.2  starts  with  a  brief  comparison  with 
previous  uses  of  probabilistic  techniques  in  information  retrieval).  Since  LSI  seems  to 
exploit  and  reveal  the  statistical  properties  of  a  corpus,  we  must  start  with  a  rigorous 
probabilistic  model  of  the  corpus  (that  is  to  say,  a  mathematical  model  of  how  corpora 
are  generated);  we  do  this  in  Section  4.2.  Briefly,  we  model  topics  as  probability 
distributions  on  terms.  A  document  is  then  a  probability  distribution  that  is  the 
convex  combination  of  a  small  number  of  topics.  We  also  include  in  our  framework 
style  of  authorship,  which  we  model  by  a  stochastic  matrix  that  modifies  the  term 
distribution.  A  corpus  is  then  a  collection  of  documents  obtained  by  repeatedly 
sampling  a  probability  distribution  on  combinations  of  topics  and  styles. 

Once  we  have  a  corpus  model,  we  would  like  to  determine  under  what  conditions 
LSI  results  in  enhanced  retrieval  properties.  We  would  like  to  prove  a  theorem  stat¬ 
ing  essentially  that  if  the  corpus  is  a  reasonably  focused  collection  of  meaningfully 
correlated  documents,  then  LSI  does  well.  The  problem  is  to  define  these  terms  so 
that  (1)  there  is  a  reasonably  close  correspondence  with  what  they  mean  intuitively 
and  in  practice,  and  (2)  the  theorem  can  be  proved.  In  Section  4.3  we  prove  results 
that,  although  not  totally  comprehensive  and  general,  definitely  point  to  this  direc¬ 
tion.  In  particular,  we  show  that  in  the  special  case  in  which  (a)  thei'e  is  no  style 
modifier;  (b)  each  document  is  on  a  single  topic;  and  (c)  the  terms  are  partitioned 
among  the  topics  so  that  each  topic  distribution  has  high  probability  on  its  own  terms, 
and  low  probability  on  all  others;  then  LSI,  projecting  to  a  subspace  of  dimension 
equal  to  the  number  of  topics,  will  discover  these  topics  exactly,  with  high  probability 
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(Theorem  9). 

In  Section  4.4  we  point  out  an  interesting  fact;  if  we  project  the  term-document 
matrix  on  a  completely  random  low-dimensional  subspace,  then  with  high  probability 
we  have  a  distance-preservation  property  akin  to  that  enjoyed  by  LSI.  This  ran¬ 
dom  projection  idea  may  yield  an  interesting  improvement  on  LSI:  we  can  perform 
the  LSI  precomputation  not  on  the  original  term-document  matrix,  but  on  a  low¬ 
dimensional  projection,  at  great  computational  savings  and  no  great  loss  of  accuracy 
(Theorem  11). 

This  last  result  can  be  seen  as  an  alternative  to  (and  a  justification  of)  sampling 
in  LSI.  Reports  on  LSI  experiments  in  the  literature  seem  to  suggest  that  LSI  is 
often  done  not  on  the  entire  corpus,  but  on  a  randomly  selected  subcorpus  (both 
terms  and  documents  may  be  sampled,  although  it  appears  that  most  often  docu¬ 
ments  are).  There  is  very  little  non-empirical  evidence  of  the  accuracy  of  such  an 
approach.  Our  result  suggests  a  different  and  more  elaborate  (and  computationally 
intensive)  approach  —  projection  on  a  random  low-dimensional  subspace  —  which 
can  be  rigorously  proved  to  be  accurate.  We  supplement  several  of  our  theorems  with 
experiments  on  corpora  derived  from  our  statistical  model. 

4.1.1  A  review  of  LSI  in  information  retrieval 

A  corpus  is  a  collection  of  documents.  Each  document  is  a  collection  of  terms  from 
a  universe  of  n  terms.  Each  document  can  thus  be  represented  as  a  vector  in  3?” 
where  each  axis  represents  a  term.  The  coordinate  represents  some  function  of  the 
number  of  times  the  term  occurs  in  the  document.  This  is  the  standard  vector- 
space  representation  of  documents.  There  are  several  candidates  for  the  right  function 
to  be  used  here;  we  assume  that  it  is  the  relative  frequency  of  the  term  (number  of 
times  the  term  occurs/total  number  of  terms  in  the  document). 

Let  A  be  an  n  X  m  matrix  whose  rows  represent  terms  and  columns  represent 
documents.  Let  the  rank  of  A  be  r.  Let  the  singular  values  of  A  be  Ci  >  <72  >  . . .  >  <7^ 
(not  necessarily  distinct),  i.e.,  af,  <7|, . . .  <7^  are  the  eigenvalues  of  AA^.  The  singular 
value  decomposition  of  A  expresses  A  as  the  product  of  three  matrices  A  =  UDV^ , 
where  D  =  diag((7i, . . . ,  <7^)  is  an  r  x  r  matrix,  U  =  (ui, . . . ,  u^)  is  an  n  x  r  matrix 
whose  columns  are  orthonormal,  and  V  =  (ui,...,t;r)  is  an  m  x  r  matrix  which  is 
also  column-orthonormal. 

LSI  works  by  omitting  all  but  the  k  largest  singular  values  in  the  above  decompo¬ 
sition,  for  some  appropriate  k",  here  k  is  the  dimension  of  the  low-dimensional  space 
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alluded  to  in  the  informal  description  of  Section  4.1.  It  should  be  small  enough  to 
enable  fast  retrieval,  and  large  enough  to  adequately  capture  the  structure  of  the 
corpus.  Let  Dk  =  diag((7i, . . . ,  cr/,),  Uk  =  (ui,...,  Uk)  and  Vk  =  (v^, . . .  ,Vk).  Then 

Ak  =  UkD,V^ 

is  a  matrix  of  rank  k,  which  is  our  approximation  of  A.  The  rows  of  VkDk  above 
are  then  used  to  represent  the  documents.  In  other  words,  the  column  vectors  of  A 
(documents)  are  projected  to  the  ^-dimensional  space  spanned  by  the  column  vectors 
of  Uk]  we  sometimes  call  this  space  the  LSI  space  of  A. 

How  good  is  this  approximation?  The  following  well-known  theorem  gives  us  some 
idea. 

Theorem  8  (Eckart  and  Young,  see  [32].)  Among  all  n  x  m  matrices  C  of  rank  at 
most  k,  Ak  is  the  one  that  minimizes  ||H  —  C\f  —  ~  . 

Therefore,  LSI  preserves  (to  the  extent  possible)  the  relative  distances  (and  hence, 
presumably,  the  retrieval  capabilities)  in  the  term-document  matrix  while  projecting 
it  to  a  lower-dimensional  space.  It  remains  to  be  seen  in  what  way  it  improves  these 
retrieval  capabilities. 


4.2  The  Probabilistic  Corpus  Model 

There  are  many  useful  formal  models  of  IR  in  the  literature,  and  probability  plays 
a  major  role  in  many  of  them  —  see  for  instance  the  surveys  and  comparisons  in 
[30,  53,  56].  The  approach  in  this  body  of  work  is  to  formulate  information  retrieval  as 
a  problem  of  learning  the  concept  of  “relevance”  that  relates  documents  and  queries. 
The  corpus  and  its  correlations  plays  no  central  role.  In  contrast,  our  focus  is  on  the 
probabilistic  properties  of  the  corpus. 

Since  LSI  exploits  and  brings  out  the  structure  of  the  corpus  it  will  fare  well 
in  a  meaningful  collection  of  strongly  correlated  documents,  and  will  produce  noise 
in  a  random  set  of  unrelated  documents.  In  order  to  study  the  dependence  of  the 
performance  of  LSI  on  the  statistical  properties  of  the  corpus,  we  must  start  with  a 
probabilistic  model  of  a  corpus.  We  state  now  our  basic  probabilistic  model,  which 
we  will  work  with  for  much  of  this  chapter.  In  Section  4.5.1  we  will  extend  this  to  a 
more  general  graph-theoretic  model  in  which  the  conductances  of  subsets  of  vertices 
will  play  a  role. 
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Let  the  universe  of  all  terms  be  U.  A  topic  is  a  probability  distribution  on  U.  A 
meaningful  topic  is  very  different  from  the  uniform  distribution  on  U,  and  is  concen¬ 
trated  on  terms  that  might  be  used  to  talk  about  a  particular  subject.  For  example, 
the  topic  of  “space  travel”  might  favor  the  terms  “galaxy”  and  “starship”,  while  rarely 
mentioning  “misery”  or  “spider”.  A  possible  criticism  against  this  model  is  that  it 
does  not  take  into  account  correlations  of  terms  within  the  same  topic  (for  example, 
a  document  on  the  topic  “Internet  ”  is  much  more  likely  to  contain  the  term  “search” 
if  it  also  contains  the  term  “engine”). 

The  structure  of  documents  is  also  heavily  affected  by  authorship  style.  We  model 
style  as  a  |f7|  x  |f7|  stochastic  matrix  (a  matrix  with  nonnegative  entries  and  row 
sums  equal  to  1),  denoting  the  way  whereby  style  modifies  the  frequency  of  terms. 
For  example,  a  “formal”  style  may  map  “car”  often  to  “automobile”  and  “vehicle,” 
and  seldom  to  “car”  —  and  almost  never  to  “wheels.”  Admittedly,  this  is  not  a 
comprehensive  treatment  of  style;  for  example,  it  makes  the  assumption  -  not  always 
valid  -  that  this  influence  is  independent  of  the  underlying  topic. 

A  corpus  model C  is  a  quadruple  C  =  {U^T^S,  D)^  where  U  is  the  universe  of  terms, 
7^  is  a  set  of  topics,  and  S  a  set  of  styles.  'T  x  S  x  Z"*",  where  by  T  we  denote  the  set 
of  all  convex  combinations  of  topics  in  7”,  by  S  the  set  of  all  convex  combinations  of 
styles  in  S,  and  by  Z"*"  the  set  of  positive  integers  (the  integers  represent  the  lengths  of 
documents).  That  is,  a  corpus  model  is  a  probability  distribution  on  topic  combina¬ 
tions  (intuitively,  favoring  combinations  of  a  few  related  topics),  style  combinations, 
and  document  lengths  (total  number  of  term  occurrences  in  a  document). 

A  document  is  generated  from  a  corpus  model  C  —  {U.,T,S,  D)  through  the 
following  two-step  sampling  process.  In  the  first  step,  a  convex  combination  of  topics 
T  from  T,  a  convex  combination  of  styles  S  from  <5,  and  a  positive  integer  £  are 
sampled  according  to  distribution  D.  Then  terms  are  sampled  £  times  to  form  a 
document,  each  time  according  to  distribution  TS.  A  corpus  of  size  m  is  a  collection 
of  m  documents  generated  from  C  by  repeating  this  two-step  sampling  process  m 
times. 


4.3  An  attempt  to  explain  LSI’s  success 

We  now  establish  a  result  based  on  our  model  that  provides  some  intuition  for  LSI’s 
empirical  success.  We  begin  with  some  tools  from  spectral  analysis  in  Section  4.3.1; 
following  our  analysis  in  Section  4.3.2,  we  give  some  experimental  results  using  our 
synthetic  model  in  support  of  this  analysis.  More  extensive  experimentation  using 
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large  corpora  of  “real”  documents  further  supports  this  analysis  (see  Section  4.5). 

4.3.1  Tools 

The  following  lemma  formalizes  the  intuition  that  if  the  k  largest  singular  values 
of  a  matrix  A  are  well-separated  from  the  remaining  singular  values  then  the  sub¬ 
space  spanned  by  the  corresponding  singular  vectors  is  preserved  well  when  a  small 
perturbation  is  added  to  A. 

Lemma  7  Let  A  be  an  n  x  m  matrix  of  rank  r  with  singular  value  decomposition 

A  =  UDV^, 

where  D  =  diag(cri, . . . ,  <7^).  Suppose  that,  for  some  k,  I  <  k  <  r,  Gklcrk+i  >  ca^jak 
for  sufficiently  large  constant  c.  Let  F  be  an  arbitrary  n  x  m  matrix  with  11^112  <  e, 
where  e  is  a  sufficiently  small  positive  constant.  Let  A'  =  A  F  and  let  U'D'V'^ 
be  its  singular-value  decomposition.  Let  Uk  and  be  n  x  k  matrices  consisting  of 
the  first  k  columns  of  U  and  U'  respectively.  Then,  U'k  =  UkR  +  G  for  some  k  x  k 
orthonormal  matrix  R  and  some  n  x  k  matrix  G  with  ||G||2  <  0(e). 

The  proof  of  this  lemma,  given  in  the  appendix,  relies  on  a  theorem  of  Stewart  [33] 
about  perturbing  a  symmetric  matrix. 

4.3.2  Analysis  of  LSI 

We  now  show  that  in  a  restricted  version  of  our  probabilistic  model,  LSI  brings 
together  documents  on  the  same  topic  while  keeping  apart  documents  on  different 
topics.  Let  C  =  (U,T,  D)  be  a  corpus  model.  We  call  C  pure  if  each  document  talks 
only  about  a  single  topic.  We  call  C  t-separable,  where  0  <  c  <  1,  if  a  set  of  terms 
Ut  is  associated  with  each  topic  T  G  T  so  that  (1)  Ut  are  mutually  disjoint  and 
(2)  for  each  T,  the  total  probability  T  assigns  to  the  terms  in  Ut  is  at  least  1  —  e. 
We  call  Ut  the  primary  set  of  terms  of  topic  T.  The  assumption  that  a  corpus  is 
e-separable  for  some  small  value  of  e  is  more  realistic  if  the  documents  are  assumed 
to  be  preprocessed  to  eliminate  commonly-occurring  stop-words. 

Let  C  be  a  pure  corpus  model  and  let  k  —  |T|  denote  the  number  of  topics  in 
C.  Since  C  is  pure,  each  document  generated  from  C  is  in  fact  generated  from  some 
single  topic  T :  we  say  that  the  document  belongs  to  the  topic  T.  Let  C  be  a  corpus 
generated  from  C  and,  for  each  document  d  £  C ,  let  Vd  denote  the  vector  assigned  to 
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d  by  the  rank-A;  LSI  performed  on  C.  We  say  that  the  rank-A;  LSI  is  S-skewed  on  the 
corpus  instance  C  if,  for  each  pair  of  documents  d  and  d',  Vd-Vd'  <  <^||yd||||V(i'||  if  d  and 
d'  belong  to  different  topics  and  Vd  ■  Vd'  >  I  —  (^||yrf||||u<i'||  if  they  belong  to  the  same 
topic.  Informally,  the  rank-A;  LSI  is  d-skewed  on  a  corpus  (for  small  d),  if  it  assigns 
nearly  orthogonal  vectors  to  two  documents  from  different  topics  and  nearly  parallel 
vectors  to  two  documents  from  a  single  topic:  LSI  does  a  particularly  good  job  of 
classifying  documents  when  applied  to  such  a  corpus.  The  following  theorem  states 
that  a  large  enough  corpus  (specifically,  when  the  number  of  documents  is  greater 
than  the  number  of  terms)  generated  from  our  restricted  corpus  model  indeed  has 
this  nice  property  with  high  probability. 

Theorem  9  Let  C  be  a  pure  and  e-separable  corpus  model  with  k  topics  such  that  the 
probability  each  topic  assigns  to  each  term  is  at  most  t,  where  t  >  0  is  a  sufficiently 
small  constant.  Let  C  be  a  corpus  of  m  documents  generated  from  C.  Then,  the 
rank-k  LSI  is  0{e)-skewed  on  C  with  probability  1  — 

Proof.  Let  Ci  denote  the  subset  of  the  generated  corpus  C  consisting  of  docu¬ 
ments  belonging  to  topic  T,  1  <  z  <  A:.  To  see  the  main  idea,  let  us  first  assume  that 
e  =  0.  Then,  each  document  of  Ci  consists  only  of  terms  in  Ui,  the  primary  set  of 
terms  associated  with  topic  T,.  Thus,  the  term-document  matrix  A  representing  cor¬ 
pus  C  consists  of  blocks  Bi,  1  <  i  <  k:  the  rows  of  Bi  correspond  to  terms  in  Ui  and 
columns  of  Bi  correspond  to  documents  in  Cp,  the  entire  matrix  A  can  have  non-zero 
entries  in  these  rows  and  columns  only  within  Bi.  Therefore,  A^A  is  block-diagonal 
with  blocks  BjBi,  \  <i  <k.  Now  focus  on  a  particular  block  Bf  Bi  and  let  and 
A'  denote  the  largest  and  the  second  largest  eigenvalues  of  Bf  Bi.  Intuitively,  the  ma¬ 
trix  BfBi  is  essentially  the  adjacency  matrix  of  a  random  bipartite  multigraph  and 
then,  from  the  standard  theory  of  spectra  of  graphs[18],  we  have  that  A^/Aj  0  with 
probability  1  as  r  — >■  0  and  \Ci\  — >■  oo.  Below  we  give  a  formal  justification  of  this 
by  showing  that  a  quantity  that  captures  this  property,  the  conductance  [36]  (equiva¬ 
lently,  expansion)  of  Bf  Bi  is  high.  The  conductance  of  an  undirected  edge-weighted 
graph  G  =  (V,  E)  is 

■5CV  min{|5|,|5|} 

Let  . .  ,x^  be  random  documents  picked  from  the  topic  T*.  Then  we  will 

show  that  the  conductance  is  fl(i^),  where  IT,!  is  the  number  of  terms  in  the  topic 
Tj.  Let  G  be  the  graph  induced  by  the  adjacency  matrix  BfBi.  For  any  subset  S  of 
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the  vertices  (documents), 

i€S,jes  i^sjes 

ies 

Assume  w.l.o.g.  that  |5|  <  |5|.  Let  Ps  be  the  probability  of  the  term  in  Ti.  Then 
we  can  estimate,  for  each  term,  ^  t^i^{Ps/‘^,Ps  —  e}  with  probability  at  least 

1  —  ^  using  the  independence  of  the  x^s  via  a  simple  application  of  Chernoff-Hoeffding 
bound  [35].  Using  this  we  lower  bound  the  weight  of  the  cut  {S,  S): 

•  (51^0  >  I]ar;(min{p,/2,p,  -  e}) 

«G5  j^s 

which  is  with  high  probability  by  a  second  application  of  the  Chernoff-Hoeffding 

bound.  The  desired  bound  on  the  conductance  follows  from  this. 

Thus,  if  the  sample  size  m  =  IC]  is  sufficiently  large,  and  the  maximum  term 
probability  r  is  sufficiently  small  (note  this  implies  that  the  size  of  the  primary  set 
of  terms  for  each  topic  is  sufficiently  large),  the  k  largest  eigenvalues  of  A  are  A*-, 
1  <  f  <  A:,  with  high  probability.  Suppose  now  that  our  sample  C  indeed  enjoys  this 
property.  Let  Ui  denote  the  eigenvector  of  Bj Bi  corresponding  to  eigenvalue  A,  (in 
the  space  where  coordinates  are  indexed  by  the  terms  in  Ti)  and  let  u,  be  its  extension 
to  the  full  term  space,  obtained  by  padding  zero  entries  for  terms  not  in  T.  Then,  the 
A;-dimensional  LSI-space  for  corpus  C  is  spanned  by  the  mutually  orthogonal  vectors 
Ui,  1  <  i  <  k.  When  a  vector  vj,  representing  a  document  d  €  Ci  is  projected  into 
this  space,  the  projection  is  a  scalar  multiple  of  u,-,  because  vj,  is  orthogonal  to  Uj  for 
every  j  ^  i. 

Caveat:  Although  the  above  argument  might  seem  to  support  the  spec¬ 
ulation  that  each  basis  vector  of  the  LSI  space  found  by  LSI  corresponds 
to  one  topic,  this  is  not  necessarily  the  case,  at  least  in  our  model.  If 
the  eigenvalues  Aj,  1  <  i  <  fc,  in  the  above  analysis  are  all  distinct,  then 
LSI  indeed  finds  n,,  1  <  f  <  A;,  as  the  basis  vectors.  However,  if  some 
of  the  eigenvalues  are  identical,  then  LSI  may  find  a  different  set  of  basis 
vectors.  Note  that  the  argument  in  the  previous  paragraph  is  not  relying 
on  LSI  finding  exactly  the  eigenvectors  n*,  1  <  i  <  A;;  it  relies  only  on  LSI 
identifying  the  subspace  spanned  by  those  vectors.  This  observation  may 
sound  an  inessential  side  note  because  in  our  probabilistic  corpus  genera¬ 
tion  model,  the  probability  that  two  of  the  eigenvalues  A*-,  Xj  are  identical 
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goes  to  zero.  In  the  more  general  case  e  >  0  we  consider  below,  however, 
we  may  not  expect  LSI  to  identify  individual  eigenvectors  corresponding 
to  single  topics,  even  if  the  eigenvalues  are  all  distinct. 


When  e  >  0,  the  term-document  matrix  A  can  be  written  as  A  =  B  +  F,  where  B 
consists  of  blocks  Bi  as  above  and  F  is  a  matrix  with  small  ||L||2-norm  (not  exceeding  e 
by  much,  with  high  probability).  As  observed  in  the  above  analysis  for  the  case  e  =  0, 
the  invariant  subspace  Wk  of  B^B  corresponding  to  its  largest  k  eigenvalues  is  an  ideal 
representation  space  for  representing  documents  according  to  their  topics.  Our  hope 
is  that  the  small  perturbation  F  does  not  prevent  LSI  from  identifying  Wk  with  small 
errors.  This  is  where  we  apply  Lemma  7.  Let  Wj^  denote  the  A:-dimensional  space  the 
rank-A:  LSI  identifies.  The  e-separability  of  the  corpus  model  implies  that  the  two- 
norm  of  the  perturbation  to  the  document-term  matrix  is  0(e)  and,  therefore  by  the 
lemma,  the  two-norm  of  the  difference  between  the  matrix  representations  of  Wk  and 
is  0(e).  Since  Wj.  is  a  small  perturbation  of  ILfc,  projecting  a  vector  representing 
a  document  in  Ci  into  Wf.  yields  a  vector  close,  in  its  direction,  to  Ui  (the  dominating 
eigenvector  of  Bj Bi).  Therefore,  the  LSI  representations  of  two  documents  are  almost 
in  the  same  direction  if  they  belong  to  the  same  topic  and  are  nearly  orthogonal  if 
they  belong  to  different  topics.  A  quantitative  analysis  (Lemma  10)  shows  that  the 
rank-fc  LSI  is  indeed  0(e)-skewed  on  C  with  high  probability.  □ 


4.3.3  Experiments 

Even  though  Theorem  9  gives  an  asymptotic  result  and  only  claims  that  the  prob¬ 
ability  approaches  1  as  the  size  parameters  grow,  the  phenomenon  it  indicates  can 
be  observed  in  corpora  of  modest  sizes,  as  is  seen  in  the  following  experiment.  We 
generated  1000  documents  (each  50  to  100  terms  long)  from  a  corpus  model  with  2000 
terms  and  20  topics.  Each  topic  is  assigned  a  disjoint  set  of  100  terms  as  its  primary 
set.  The  probability  distribution  for  each  topic  is  such  that  0.95  of  its  probability 
density  is  equally  distributed  among  terms  from  the  primary  set,  and  the  remain¬ 
ing  0.05  is  equally  distributed  among  all  the  2000  terms.  Thus  this  corpus  model  is 
0.05-separable.  We  measured  the  angle  between  all  pairs  of  documents  in  the  original 
space  and  in  the  rank  20  LSI  space.  The  following  is  a  typical  result.  Call  a  pair  of 
documents  intra-topic  if  the  two  documents  are  generated  from  the  same  topic  and 
inter-topic  otherwise. 
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intra-topic 

inter-topic 

min 

max 

average 

std 

min 

max 

average 

std 

original  space 

0.801 

1.39 

1.09 

0.079 

1.49 

1.57 

1.57 

0.00791 

LSI  space 

0 

0.312 

0.0177 

0.0374 

0.101 

1.57 

1.55 

0.153 

Here,  angles  are  measured  in  radians.  It  can  be  seen  that  the  angles  of  intra-topic 
pairs  are  dramatically  reduced  in  the  LSI  space.  Although  the  minimum  inter-topic 
angle  is  rather  small,  indicating  that  some  inter-topic  pairs  can  be  close  enough  to  be 
confused,  the  average  and  the  standard  deviation  show  that  such  pairs  are  extremely 
rare.  Similar  results  are  obtained  from  ten  repeated  trials.  Results  from  experiments 
with  different  size-parameters  are  also  similar  in  spirit. 

In  this  and  the  other  experiments  reported  here,  we  used  SVDPACKC  [11]  for 
singular  value  decomposition. 


4.4  Fast  LSI  via  pandom  projection 

A  lemma  of  Johnson  and  Lindenstrauss  shows  that  if  points  in  a  vector  space  are  pro¬ 
jected  to  a  random  subspace  of  suitably  high  dimension,  then  the  distances  between 
the  points  are  approximately  preserved.  Although  such  a  random  projection  can  be 
used  to  reduce  the  dimension  of  the  document  space,  it  does  not  bring  together  se¬ 
mantically  related  documents.  LSI  on  the  other  hand  seems  to  achieve  the  latter,  but 
its  computation  time  is  a  bottleneck.  This  naturally  suggests  the  following  approach: 

1.  Apply  a  random  projection  to  the  initial  corpus  to  /  dimensions,  for  some  small 
/  >  A;,  to  obtain,  with' high  probability,  a  much  smaller  representation,  which  is 
still  very  close  (in  terms  of  distances  and  angles)  to  the  original  corpus. 

2.  Apply  rank  k  LSI  to  the  documents  in  the  projected  space  to  get  the  final  result. 

We  prove  that  the  above  approach  is  good  in  the  sense  that  the  the  final  representation 
is  very  close  to  what  we  would  get  by  directly  applying  LSI.  Another  way  to  view  this 
result  is  that  random  projection  gives  us  a  fast  way  to  approximate  the  eigenspace 
(eigenvalues,  eigenvectors)  of  a  matrix. 

We  first  state  the  Johnson-Lindenstrauss  lemma. 

Lemma  8  (Johnson  and  Lindenstrauss,  see  [28,  38].)  Let  v  ^  be  a  unit  vector, 
let  H  be  a  random  l-dimensional  subspace  through  the  origin,  and  let  the  random 
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variable  X  denote  the  square  of  the  length  of  the  projection  of  v  onto  H.  Suppose 
0  <  e  <  |,  and  24 log  n  <  I  <  y/n.  Then,  E[X]  =  Ijn,  and 

Pr{\X  -  l/n\  >  d/72)  < 

Using  the  above  lemma,  we  can  infer  that  with  high  probability,  all  pairwise  Eu¬ 
clidean  distances  are  approximately  maintained  under  projection  to  a  random  sub¬ 
space.  By  choosing  I  to  be  in  Lemma  8,  we  have  with  high  probability  that 

the  projected  vectors,  after  scaling  by  a  factor  -^^1//,  {u-},  satisfy 

ii^i  -  Vj\\2{l  -  e)  <  ||u'  -  t;'||2  <  ||ui  -  Vj\\2{l  -f  eps). 

Similarly  inner  products  are  also  preserved  approximately:  2u,-  •  Vj  =  v]  -f  — 
(uj  —  Vj)^.  So  the  projected  vectors  satisfy 

2u'  •  v'-  <  {vf  +  vj){l  +  e)  -  (vi  -  Vjf{\  -  e) 

Therefore,  v'-  ■  uj  <  Vi  •  Vj{l  —  e)  -f  e{vf  +  u|).  In  particular,  if  the  u^’s  are  all  of  length 
at  most  1,  then  any  inner  product  u,-  •  changes  by  at  most  2e. 

Consider  again  the  term-document  matrix  A  generated  by  our  corpus  model.  Let 
i?  be  a  random  column-orthonormal  matrix  with  n  rows  and  I  columns,  used  to 
project  A  down  to  an  /-dimensional  space.  Let  B  =  yJfRF A  be  the  matrix  after 
random  projection  and  scaling,  where, 

r 

A  =  Y^  GiUivJ 

i=l 

and 

i=l 

are  the  SVD’s  of  A  and  B  respectively. 

Theorem  10  Let  e  be  an  arbitrary  positive  constant.  If  I  >  for  a  sufficiently 
large  constant  c  then,  for  p  —  1, . . . ,  f 


Proof.  The  eigenvalue  of  B  can  be  written  as 

p—i  p—i 

Ap  =  max\y\^iv'^[B  -  a^bJ^^lB  -  ^  ajbj]v 
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Consider  the  above  expression  for  Ui, . . . ,  Vk,  the  first  k  eigenvectors  of  A.  For  the 
eigenvector  Vi  it  can  be  reduced  to 

j=l 

j=l 

j=l 

>  (1  -  e)af  - 

i=i 

Summing  this  up  for  i  =  1, ,  fc, 

s  >  ( 1  -  c)  ^  c.?  -  £  A]  J2{b,-v.  f 

i~l  i=l  j—1  i=l 

Since  the  vjs  are  orthogonal  and  the  6j’s  are  unit  vectors, 

i=l  j=i 

Hence 

Aj  >  maxy^vf  Bv^  >  ^[(1  -  e)^cr,f  -  ^ 

^  1=1  j=i 

□ 


Theorem  11 

WBnh^  >  {1  -  e)\\Ak\\2^ 

In  other  words  the  matrix  obtained  by  RP+LSI  recovers  most  of  the  matrix  ob¬ 
tained  by  direct  LSI. 

If  we  further  assume  that  only  the  top  k  eigenvalues  of  A^ A  are  dominant,  than 
we  can  prove  more. 

How  much  faster  is  the  two  step  method?  Let  T  be  an  n  x  m  matrix.  Then 
the  time  to  compute  LSI  is  0{rarB)  for  a  dense  matrix.  On  the  other  hand  if  A 
is  sparse,  this  goes  down  to  0{mnc)  where  c  is  the  (average)  number  of  non-zero 
entries  in  a  column  of  A,  i.e.  the  number  of  terms  in  a  document.  The  time  to 
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compute  the  random  projection  to  I  dimensions  is  0{mnl)  for  dense  matrices  and 
0{mcl)  for  sparse  matrices.  After  the  projection,  the  time  to  compute  LSI  is  0{mP). 
So  the  total  time  is  0{ml{l  +  c)).  To  obtain  an  e  approximation  we  need  I  to  be 
Thus  the  running  time  of  the  two-step  method  is  asymptotically  superior: 
(9(m(log^  n  -f  clog  n))  compared  to  0{mnc). 


4.5  Conclusion  and  further  work 

Recently,  in  personal  communication,  S.  Vathyanathan  and  D.  Modha  (at  IBM  Al- 
maden)  give  preliminary  reports  of  success  on  real-life  corpora  with  methods  involving 
RP  and  SVD, 

A  theoretician’s  first  reaction  to  an  unexpected  (positive  or  negative)  empirical 
phenomenon  is  to  understand  it  in  terms  of  mathematical  models  and  rigorously 
proved  theorems;  this  is  precisely  what  we  have  tried  to  do,  with  substantial  if  partial 
success.  What  we  have  been  able  to  prove  should  be  seen  as  a  mere  indication  of 
what  might  hold;  we  expect  the  true  positive  properties  of  LSI  to  go  far  beyond  the 
theorems  we  are  proving  here. 

There  are  several  specific  issues  to  be  pursued  here.  Two  of  them  are,  a  model 
where  documents  could  belong  to  several  topics,  and  one  where  term  occurrences  are 
not  independent.  Another  issue  is,  does  LSI  address  polysemy?  We  have  seen  some 
evidence  that  it  handles  synonymy. 

Theory  should  ideally  go  beyond  the  ex  post  facto  justification  of  methods  and 
explanation  of  positive  phenomena,  it  should  point  the  way  to  new  ways  of  exploiting 
them  and  improving  them.  Section  4.4,  in  which  we  propose  a  random  projection 
technique  as  a  way  of  speeding  up  LSI  (and  possibly  as  an  alternative  to  it),  is  an 
attempt  in  this  direction. 

4*5.1  A  more  general  model 

In  this  model  we  will  view  the  corpus  as  an  edge-weighted  graph.  There  is  a  vertex 
in  the  graph  for  each  document.  The  weight  of  an  edge  between  two  documents  u 
and  V  denotes  the  similarity  between  the  documents,  with  higher  weight  indicating 
greater  similarity,  e.g.  u  •  v.  Documents  that  constitute  a  topic  form  an  induced 
subgraph  with  high  conductance  [36]  in  this  graph.  This  addresses  the  intuitive  idea 
that  two  documents  could  be  related  through  other  documents  even  if  they  do  not 
have  many  terms  in  common  directly.  Also  notice  that,  in  general,  a  single  document 
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could  belong  to  many  different  topics,  which  calls  for  a  substantial  extension  of  the 
theory  and  techniques  developed  here. 

We  now  extend  our  analysis  of  LSI  to  this  more  general  setting.  Assume  that 
each  document  belongs  to  a  single  topic.  The  graph  can  then  be  partitioned  into 
the  topics,  so  that  the  subgraph  induced  by  each  topic  has  high  conductance.  The 
adjacency  matrix  A^A  can  thus  be  written  as  A'^A'  +  F  with  the  following  properties; 

•  A'^A'  is  a  block  diagonal  matrix 

•  Each  block  corresponds  to  a  topic 

•  The  first  two  eigenvalues  of  each  block  are  well-separated. 

•  The  matrix  E  is  a  perturbation  of  low  norm. 

Let  there  be  k  topics  in  the  corpus.  Applying  lemma  7  we  see  that  a  rank  k 
LSI  will  closely  approximate  the  eigenspace  spanned  by  the  first  k  eigenvectors  of 
A'^ A'.  Assuming  that  the  first  eigenvalues  of  the  blocks  are  all  greater  than  any  of 
the  second  eigenvalues,  this  implies  that  LSI  will  separate  the  documents  according 
to  their  topics. 


4.6  Appendix:  Proof  of  Lemma  7 

In  the  following  version  of  Lemma  7,  we  take  some  specific  values  for  some  constants 
to  facilitate  the  proof;  note  that  the  choice  of  those  values  are  arbitrary  to  a  large 
extent. 

Lemma  9  Let  A  be  an  n  x  m  matrix  of  rank  r  with  singular  value  decomposition 

A  =  UDV^, 

where  D  =  diag((7i, . . . ,  cr^).  Suppose  that,  for  some  k,  1  <  k  <  r,  21/20  >  (Ti  > 
...  >  >  19/20  and  1/20  >  >  ...  >  cr^.  Let  F  be  an  arbitrary  n  x  m 

matrix  with  ||F||2  <  e  <  1/20.  Let  A'  =  AA  F  and  let  U'D'V'^  be  its  singular-value 
decomposition.  Let  Uk  and  Uj.  be  n  x  k  matrices  consisting  of  the  first  k  columns  of 
U  and  U'  respectively.  Then,  (/(.  =  UkR  +  G  for  some  k  x  k  orthonormal  matrix  R 
and  some  n  x  k  matrix  G  with  ||(T||2  <  9e. 

The  proof  of  this  lemma  relies  on  a  theorem  of  Stewart  [33]  about  perturbing  a 
symmetric  matrix. 
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Theorem  12  Suppose  B  and  B  E  are  n  x  n  symmetric  matrices  and 


Q  —  [  Qi  Q2  ] 

k  n  —  k 


is  an  n  X  n  orthogonal  matrix  such  that  range{Qi)  is  an  invariant  subspace  for  B. 
Partition  the  matrices  Q^BQ  and  EQ  as  follows,  where  Bu  and  Eu  are  k  x  k 
matrices: 


Q^BQ  = 
Q^EQ  = 


Bn  0 
0  B22 

En  Ei2 
E21  E22 


If 


8=\r 


tJ'r, 


\E- 


11  2 


ll-E'22112  >  0, 


where  Xmin  the  smallest  eigenvalue  of  Bn  and  Umax  Is  the  largest  eigenvalue  of  B22, 
and  II -£'12 II 2  "S  812  then  there  exists  an  in  —  k)  xk  real  matrix  P  such  that 


ll^'IU  <  ||I-E!.II2 


and  the  columns  of  Q\  =  {Qi  +  Q2P){I  +  P^P)  form  an  orthonormal  basis  for  a 
subspace  that  is  invariant  for  B  E. 


Proof,  of  Lemma  9.  We  apply  Theorem  12  to  5  =  AA^ ,  E  =  A'{A')'^  —  B.  We 
choose  the  block-diagonalizing  matrix  Q  in  the  theorem  to  be  U  followed  by  n  —  r 
zero-columns.  Thus,  when  we  write  Q  =  [Q1Q2],  Qi  -  Uk^  the  first  k  columns  of  U, 
and  Q2  consists  of  remaining  columns  of  U  followed  by  zero-columns.  Since  BU  is 
a  diagonal  matrix,  BQ  is  also  a  diagonal  matrix.  Let  EQ  be  decomposed  into 
blocks  Eij,  1  <  i,j  <  2,  as  in  Theorem  12.  To  apply  the  theorem,  we  need  to  bound 
ll^cilb-  We  do  this  simply  by  bounding  ||£'||2-  Since  E  =  {A-\-  E){A-\-  EY  —  AA^  = 
AF^'  -h  FA^  +  FF^,  we  have  ||£;||2  <  2||A||2||F||2  +  ||F||2'  <  2(21/20)c  +  < 

(43/20)e.  Therefore,  ||Fjj||2  <  (43/20)e,  1  <  i,j  <  2.  The  non-zero  eigenvalues  of 
B  are  crj,  ...,al.  Of  these,  cr^, . . .  ,<t|  >  361/400  and  . . .  ,<7^  <  1/400.  Hence 
=  Xmin  —  9max  —  ||Fii||2  —  IIF22II2  is  positive:  8  >  361/400  —  1/400  —  (43/10)e  > 
137/200.  Also  we  have  IIF12II2  <  (43/20)e  <  43/400  <  8/2  and  all  the  assumptions 
of  Theorem  12  are  satisfied.  It  follows  that  there  exists  an  ((n  —  k)  x  k  matrix  P 
satisfying  ||P||2  <  f  IIF21II2  <  7e  such  that 

Q'i  =  {Qi  +  Q2P}il  +  P^Pr'^^ 


(4.1) 
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forms  an  orthonormal  basis  for  a  subspace  that  is  invariant  for  B  +  E.  This  invariant 
subspace  corresponds  to  the  k  largest  singular  values  ol  A  +  F.  Therefore,  the  column 
vectors  of  f/^,  (the  first  k  eigenvectors  of  jB  +  £")  span  the  same  invariant  subspace  as 
that  spanned  by  the  column  vectors  of  Q\ .  In  other  words,  there  is  &  kxk  orthonormal 
matrix  R  such  that  Uj.  =  Q'^R. 

Since  lli^ilb  <  1,  llQilb  <  1,  and 

II-BII2  <  7e,  it  follows  from  (4.1)  that  Q[  =  QiA  H  for  some  H  with  \\H\\2  <  9e. 
Therefore,  Uj.  =  UkR  +  HR,  with  ||i//?||2  <  9e,  as  claimed.  □ 

The  following  lemma  is  also  used  in  the  proof  of  Theorem  9. 

Lemma  10  Let  U  €  be  a  matrix  with  orthonormal  columns  and  let  W  € 

be  a  matrix  with  ||14^  —  U\\2  <  e.  Let  u,v,w  €  be  vectors  such  that  ||t/^u||2  = 
||I7^u||2  =  ||tf^r«||2  =  I,  {U^u,U'^v)  =  1  and  {U^u,U'^w)  =  0.  Let  u',v',w'  €  i?”  be 
arbitrary  vectors  with  ||u  —  u^||2,  ||w  —  ~  ^  Then, 

(VFV,  ITV)  >  (1  -  4e)||lTV||2||lTV||2,  and 

(1TV,TTV)  <4e. 


Chapter  5 

Sampling  Lattice  Points 


When  is  the  volume  of  a  convex  polytope  in  close  to  the  number  of  lattice  points  in 
the  polytope?  We  show,  algorithmically,  that  if  the  polytope  contains  a  ball  of  radius 
Uy/log  m,  where  m  is  the  number  of  facets,  then  the  volume  approximates  the  number 
of  lattice  points  to  within  a  constant  factor. 


5.1  Introduction 

In  recent  years,  random  sampling  has  become  ubiquitous  in  its  applicability.  In  this 
chapter  we  return  to  a  classic  theme  of  random  sampling,  namely  uniform  generation 
and  approximate  counting.  The  connection  between  uniformly  generating  from  com¬ 
binatorial  sets  and  approximately  counting  them  was  observed  more  than  a  decade 
ago  [37].  Following  that,  the  Markov  Chain  Monte-Carlo  method  has  yielded  effi¬ 
cient  algorithms  based  on  random  walks  for  a  variety  of  generation  (and  counting) 
problems. 

Here  we  consider  the  problem  of  counting  approximately  the  number  of  lattice 
points  in  an  n— dimensional  poly  tope  of  the  form 

P  -  {x  e  :  Ax  <  6}, 

where  A  is  an  m  x  n  matrix  of  nonnegative  reals  and  6  is  an  m—  vector  of  nonnegative 
reals.  Letting  denote  the  set  of  integer  points  (points  with  all  integer  coordinates), 
we  are  interested  in  the  problem  of  estimating  |P  n  Z^|  given  A,  b.  Closely  related  to 
it  is  the  problem  of  sampling  nearly  uniformly  from  the  set  P  n  Z^  [37]. 

This  problem  includes  as  a  special  case  several  combinatorial  counting  problems 
that  have  been  studied  -  like  that  of  estimating  the  permanent  of  a  0-1  matrix  [36], 
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the  number  of  contingency  tables  [26],  solutions  to  knapsack  problems  [25]  etc..  It  is 
well-known  that  the  finding  |P  fl  Z"|  exactly  is  #  P-hard  [58].  It  is  also  easy  to  show 
that  it  is  NP-hard  to  estimate  |P  H  Z"|  to  any  polynomial  (in  input  length)  factor. 
For  completeness,  we  include  a  proof  of  this  folklore  result  here. 

An  approach  to  the  problem  is  to  reduce  it  to  the  tractable  problem  of  estimating 
the  volume  of  P  or  a  related  polytope  (see  [24]).  Intuitively,  it  is  easy  to  argue  that  if 
each  entry  of  b  is  sufficiently  large,  then  the  number  of  integer  points  in  the  polytope 
is  close  to  the  volume  of  the  polytope.  In  fact,  this  general  principle  can  be  dated 
back  to  Gauss,  and  has  been  the  subject  of  many  fascinating  studies  (see  e.g.  [27]). 

In  this  chapter,  we  prove  that  if  the  polytope  contains  a  ball  of  radius  f](n\/log  m), 
then  the  volume  of  P  approximates  |PnZ”[  well.  We  also  show  that  this  is  essentially 
tight.  The  proof  is  relatively  simple  and  is  algorithmic,  so  it  actually  yields  an 
algorithm  to  sample  nearly  uniformly  from  P  D  Z"  under  the  condition.  We  give 
a  simple  class  of  examples  to  show  that  the  bound  is  tight  to  within  constants. 

From  this  general  result,  several  interesting  special  cases  follow. 

One  interesting  case  is  that  of  sampling  uniformly  from  the  set  of  nonnegative 
integer  matrices  with  specified  row  and  column  sums  (called  “contingency  tables”). 
Specifically,  the  problem  is  the  following  ;  for  natural  numbers  m,  n,  we  are  given  “row 
sums”  Ti,  r2, . . .  Tm  and  column  sums  Ci,  C2, . . .  c„  which  are  all  nonnegative  integers 
with  =  Jlcj-  ^  sample  nearly  uniformly  from  the  integer  points 

in  the  polytope 

P  =  {x  G  R™"  :  ^  Xij  =  r,'  for  z  =  1, 2, . . .  ^  Xij  =  cj  for  j  =  1, 2  . . .  n}. 

j  i 

This  problem  arises  in  Statistics  and  Combinatorics  [20,  21].  Dyer,  Kannan  and 
Mount  [26]  gave  a  polynomial  time  algorithm  for  this  problem  provided  n  G  O(mn^) 
and  Cj  €  O(nm^).  Our  general  result  here  implies  the  earlier  result,  with  a  simple 
proof  (and  with  slightly  weaker  assumptions). 

A  second  special  case  is  that  of  sampling  nearly  uniformly  from  the  set  of  inte¬ 
gral  s  —  t  flows  in  a  network  G  =  {V,E).  We  show  that  if  each  edge  capacity  is 
at  least  |P['y/|P|,  we  can  do  the  sampling  in  polynomial  time  and  hence  also  solve 
approximately  the  problem  of  counting  the  number  of  integral  flows.  In  contrast,  we 
recapitulate  the  folklore  result  that  it  is  NP-hard  to  estimate  the  number  of  integral 
flows  when  the  capacities  are  all  1.  It  is  an  interesting  open  problem  to  reduce  this 
gap, 

A  third  special  case  is  the  6-matching  problem.  In  this  problem,  we  are  given  non¬ 
negative  integers  b{v)  associated  with  the  vertices  of  a  graph  G.  A  6-matching  is  an 
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edge- weighted  subgraph  of  G  whose  degree  at  vertex  v  is  at  most  b{v).  Optimization 
over  the  set  of  fe-matchings  of  a  graph  is  studied  e.g.  in  [34].  Here  we  show  that  if  all 
the  h(u)’s  satisfy  b[v)  >  \E\deg{v),  where  deg{v)  is  the  degree  of  vertex  v  in  G,  then 
we  can  sample  uniformly  from  the  set  of  6-matchings. 

Another  special  case  is  that  of  the  multidimensional  knapsack  problem.  Here  the 
polytope  P  is  of  the  form 

P  =  {x  ^  H"' :  Ax  <  b,0  <  Xi  <  d,  for  i  =  l,...,n}, 

where  A  is  a  nonnegative  integer  matrix  and  d  is  a  vector  of  “upper  bounds”.  Without 
loss  of  generality,  we  may  assume  that  djAij  <  bi  for  all  t,  j.  It  is  shown  in  [25]  that 
if  all  dj  >  then  there  is  a  polynomial  time  algorithm  to  count  approximately  the 
number  of  integer  points  in  P.  We  show  that  our  general  result  gives  a  polynomial 
time  algorithm  with  slightly  better  bounds. 


5.2  The  Sampling  Theorem 

We  call  a  point  in  R”  with  all  integer  coordinates  an  integer  point.  If  x  is  any  point, 
and  A  a  positive  real,  we  denote  by  C(x,A)  the  cube  of  side  2A  with  x  as  center. 
Suppose  A  is  an  m  X  n  matrix  of  reals  and  6  is  an  m  x  1  of  nonnegative  reals  and 

P  =  {a-  e  R"  :  Aa:  <  b}. 

Let  r  be  the  maximum  number  (over  integral  points  a;)  of  facets  of  P  that  intersect 
any  C{x,  1)  for  x  an  integral  point.  Let  Aj  denote  the  i  th  row  of  A.  Then  our  main 
result  can  be  summarized  as  follows. 

Theorem  13  For  any  polytope  P  satisfying  bi  G  fl(n-v/log  rjA^dj]  all  i,  there 
exists  a  polynomial  time  algorithm  for  nearly-uniformly  sampling  P  fl  Z". 

The  running  time  of  the  algorithms  will  be  inversely  proportional  to  the  desired 
accuracy.  The  rest  of  this  section  is  devoted  to  proving  this  theorem  by  constructing 
a  sampling  algorithm. 

Let  c  be  any  positive  real  to  be  specified  later  and  let  b'  be  a  m— vector  defined 
by 


bi  =  bi  +  {c+  y^21ogr)|Ai 

Let  P'  =  {x  :  Ax  <  b'}. 


62 


SAMPLING  LATTICE  POINTS 


Our  idea  will  be  to  pick  a  point  p  from  P'  from  a  probability  density  close  to  the 
uniform.  We  will  then  “round”  p  to  obtain  an  integer  point;  if  the  integer  point  is 
in  P,  we  accept,  otherwise,  we  reject  and  repeat.  We  will  use  a  natural  probabilistic 
rounding  procedure  which  we  describe  presently.  This  simple  rounding  procedure 
used  in  a  dilferent  context  by  Raghavan  and  Thompson  e.g.  [52],  is  the  main  new 
ingredient  here. 

For  p  €  R”,  we  define  a  vector- valued  random  variable  X{p)  by 


X{ph  = 


[PjJ  +  1  with  probability  pi  —  [pi\ 

\pi\  with  probability  1  —  p,  -)-  [p,J 


where  the  X(jp)i,i  =  1, 2, ...  n  are  chosen  independently. 


Theorem  14  Suppose  p  is  picked  from  a  probability  density  V  whose  variational 
distance  to  the  uniform  density  on  P'  is  at  most  e.  Then  for  any  x  G  PflZ”,  we  have 


1  -  2e-"'  -  £ 
vol(P') 


<  Pr(A'(p)  =  x)  < 


1 

vol{P') 


Proof.  First  we  establish  the  lower  bound.  The  idea  is  to  bound  the  probability 
of  picking  an  integer  point  x  in  terms  of  the  probability  of  picking  x  given  that  we 
pick  a  continuous  point  in  C{x^  1). 

Define  to  be  independent  identically  distributed  real  valued  random  vari¬ 

ables  each  distributed  according  to  the  density  function  1  —  |t|  on  the  real  interval 
t  G  [—1  -h  1].  In  the  calculations  below,  dp  is  an  infinitesimal  n-dimensional  volume 
and  dV  is  the  probability  of  picking  a  point  from  dp.  Then  for  any  a;  G  P  D  Z",  and 
p  G  R",  with  \pi  —  Xjj  <  1,  we  have  that 

n 

Prob.  density  (x  +  F  =  p)  =  JJ(1  —  \pi  ~  x^)- 

i~i 


where  \e'\  <  e. 


Also,  Pr(W(p)  =  a;|p)  =  ]ij(l  -  |pi  -  Xi|). 

i=l 

I  Pr{X{p)  =  x\p)dV  =  Jp  /  Pr(-A(p)  =  x\p)dp  -f  e', 
JpPP'  vol  P  JvGP' 


ol(P')  Jp^P' 

,  /  Prob.  density  {x  AY  —  p)dp  = 

JpeP' 

/  Prob.  density  (x  -f  Y  =  p)dp  —  /  Prob.  density  (x  AY  =  p)dp 

Jpec(x,i)  ^  Jc(x,i)\P'  ^ 


Now, 
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>l-J2Pr(AA^>b'-b,). 

i=l 

Now  for  a  fixed  i,  consider  the  random  variables 

AikVk- 

k—\ 

It  is  easy  to  see  that  E{Zj\Zj-i)  —  Zj-\.,  so  the  {Zj}  form  a  Martingale.  Also, 
\Zj  —  Zj^i  \  <  Aij.  So  applying  Azuma’s  inequality  [1],  we  get  that 

Pr(A.-F  >  h\  -  hi)  <  2e-^'/r. 

This  proves  the  lower  bound  in  the  Theorem. 

The  upper  bound  on  the  probability  follows  by 

/  Prob.  density  (x  -{-Y  =  p)dp  —  1. 

Jpec{x,i) 

We  show  that  each  a:  G  P  H  Z”  gets  picked  with  about  the  same  probability  by 
the  following  algorithm. 

Sampling  Algorithm 
Suppose  c  >  0  is  given. 

•  Pick  a  point  p  from  P'  with  c  =  ^In  according  to  a  probability  density  V 
whose  variational  distance  to  uniform  is  at  most  |. 

•  If  X{p)  is  in  P,  then  return  it;  otherwise  reject  and  repeat. 


It  remains  to  show  that  the  probability  of  rejection  is  not  too  high  assuming 
bounds  on  hi. 

Lemma  11  If  hi  >  Sn^log  ^\Ai\  for  all  i,  then  the  probability  of  acceptance  of  the 
sampling  algorithm  is  at  least 


Proof.  The  probability  of  acceptance  is 


E  rr{x{p) 

xePnZ^ 


,  ^  1  |Pnz”| 

-  4  vol(P')  ■ 


We  show  that 

IPnz'^l  >  ^voi(P') 

8 
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To  this  end,  suppose  we  pick  p  from  the  uniform  density  from 

P”  =  {x  :  AiX  <  bi  —  {c  +  y^21og  r)|/l,-|for  f  =  1, 2, . . .  m} 

and  let  x  =  X{p).  Then  by  the  same  argument  as  before,  with  probability  at  least 
X  will  be  in  P  n  Z”.  Also,  for  a  fixed  x  E  P  fl  Z”,  the  probability  that  we  get 
X{p)  =  a;  by  this  process  is  at  most  l/vol(P").  So, 

Thus,  \P  n  Z”|  >  lvol(P").  Now, 

v°l(P')  .  /y+^+y2§T)inV  <•  9 

vol(P")  -  ““  \h,  -  {c+V2i^)\Ai\)  - 

a 

For  algorithmic  purposes  we  can  weaken  the  lower  bound  to  be  that  each  compo¬ 
nent  of  b  is 

V  I  I  logn  ^ 

Note  that  we  assumed  as  hypothesis  that  6,:  E  fl(n\/log It  is  easy  to  see 
that  this  is  equivalent  to  saying  that  a  ball  of  radius  n(n-v/logr)  with  the  origin  as 
center  is  contained  in  P.  It  is  then  a  simple  matter  to  remove  the  restriction  that  the 
ball  have  the  origin  as  its  center. 

The  latest  algorithms  for  estimating  the  volumes  of  convex  sets  (or  sampling  from 
them)  are  quite  fast:  0*{n^)  [39].  This  follows  a  long  series  of  improvements  starting 
from  [24]  and  [47].  It  is  worth  mentioning  that  these  “continuous”  random  walks  are 
now  provably  much  faster  than  their  discrete  counterparts. 

The  arguments  used  in  proving  Theorem  13  also  prove 

Theorem  15  Let  P  he  a  polytope  in  with  m  facets  containing  a  ball  of  radius 
fl(n-v/log  mf.  Then  there  is  a  constant  c  such  that 

cjPn  Z"|  <  vol{P)  <  -|Pn  Z”|. 

5.3  Special  cases  of  the  theorem 

5.3.1  Contingency  tables 

The  problem  is  as  described  in  the  Introduction.  But  it  will  be  more  convenient  to  deal 
with  a  full  dimensional  polytope.  With  some  simple  manipulation,  it  is  easy  to  see 
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that  as  in  [26],  we  can  define  the  polytope  P  with  1  <  i  <  m  —  1  I  <  j  <  n  —  1 

as  the  variables  defined  by  the  following  constraints  : 

n  —  1 

i=i 

m— 1 

?=i 

m  — 1  n—1  m  — 1 

yy  ~  ~  ^n  ^  o. 

«=:1  j  =  l  ?=:1 

In  the  above,  as  well  as  in  the  rest  of  the  section,  i  will  run  through  1,2, ...  m  —  1 
and  j  will  run  through  1, 2, ...  n  —  1  unless  otherwise  specified. 

To  apply  Theorem  13,  we  will  reformulate  the  problem  with  the  substitution 

Vij  =  Xij  —  mn. 

Then  the  polytope  P  in  y-space  is  defined  by 

n—1 

Vij  <  ri  —  mn{n  —  1)  Vi 
j=i 

m  — 1 

X]  yij  -  C  ~  ‘rnn{m  -  1)  Vj 

i=l 

m  —  l  n—1  m  — 1 

X  X  yij  -  E  -  Cn  -  mn{n  -  l)(m  -  1) 

?  =  1  j  =  l  «  =  1 

Vij  >  -mn. 

Now  if  Ti  >  2mn^  for  each  r,  and  Cj  >  2nm^  for  each  Cj,  P  satisfies  the  theorem’s 
requirements.  We  can  relax  these  conditions  slightly,  namely  it  suffices  to  have  r,  > 
2mn^/V'log  mn  and  Cj  >  2nm? / -^/log  mn  for  each  r,  and  each  Cj  respectively. 


5.3.2  Integral  flows 

In  this  section  we  consider  the  problem  of  sampling  (nearly)  uniformly  from  the 
set  of  integral  s  —  t  flows  in  a  network.  Although  our  approach  could  be  used  for 
directed  graphs,  for  simplicity  we  assume  here  that  we  are  given  an  undirected  graph 
G  =  (V,  E)  with  capacities  on  the  edges  C  :  E  R^,  and  two  distinguished  vertices 
s,t  called  the  source  and  the  sink  respectively.  An  integral  flow  is  an  assignment 
of  integers  to  the  edges  corresponding  to  a  flow  from  s  to  T  Such  an  assignment  must 
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respect  the  capacity  constraints,  namely  the  flow  through  an  edge  must  be  less  than 
the  capacity,  and  the  conservation  constraints,  namely  the  net  flow  at  a  vertex  must 
be  zero.  To  set  up  the  problem  as  a  polytope,  we  make  two  modifications  to  G.  First 
we  add  the  edge  {t^s)  to  the  graph  (if  it  is  not  present)  so  that  we  can  enforce  the 
conservation  constraints  at  all  vertices  (i.e.  including  s  and  t).  Then  we  arbitrarily 
direct  every  edge  so  that  E  is  now  a  set  of  directed  edges.  The  resulting  polytope  of 
feasible  solutions,  P,  has  the  following  constraints: 

-C{i,j)  <  Xij  <  V(i,i)  e  E 

^ij  ~  ^ji  €  V. 

This  polytope  is  not  full-dimensional  in  It  will  be  more  convenient  to  work 

with  a  full-dimensional  polytope,  so  we  apply  some  further  transformations.  Choose  a 
spanning  tree  T  of  G  (such  a  spanning  tree  exists  because  in  order  to  have  a  non-zero 
flow  we  can  assume  that  s  and  t  are  in  the  same  connected  component;  any  component 
that  does  not  contain  s  or  t  is  irrelevant  and  can  be  deleted).  For  simplicity,  we  can 
assume  that  T  is  an  arborescence,  rooted  at  s  (say),  by  directing  the  edges  of  T  first, 
when  we  chose  directions,  and  then  the  rest  of  the  edges.  Number  the  vertices  in 
postfix  order  along  T,  so  that  each  vertex  gets  a  higher  number  than  any  descendant 
of  it.  Consider  the  conservation  constraint  for  a  leaf  vertex  ^,  let  {k,i)  G  T  be  the 
leaf  edge. 

Xki  y  ]  Xij  y  ]  Xji- 

j-LS)^E\T  j:(j,i)€E\T 

By  substituting  these  expressions  for  leaf  edges  in  the  conservation  constraints  for 
next-to-leaf  vertices,  and  recursing,  we  get  the  following  equations,  one  for  each  tree 
edge  {k,  /): 


=  E  (  E  E  V(i-,0er 

Here  S{1)  denotes  the  set  of  vertices  in  the  subtree  rooted  at  I  in  T.  So  the 
polytope  P  can  be  reformulated  in  space  as 


-C{i,j)<Xi,<C{i,3)  y{i,j)eE\T 


i€S(J)  \i:ti.,)eE\T 


Xij 


E  xj,\<c{k,i)  y{k,i)eT. 

r.{3,i)eE\T 
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Since  each  constraint  trivially  has  at  most  E  variables,  if  each  capacity  is  at  least 

3 

\E\^  then  we  can  apply  the  theorem  to  sample  the  integer  points  of  P  nearly  uniformly 
(and  thus  also  count  the  number  of  integral  flows).  Note  that  a  simple  modification 
will  allow  us  to  sample  the  number  of  flows  of  a  specified  value  f. 

5.3.3  Hardness  of  counting  flows 

To  contrast  with  the  above  result,  we  recount  the  folklore  result  that  it  is  NP-hard  to 
count  approximately  the  number  of  flows.  We  do  this  by  reducing  the  NP-complete 
problem  of  deciding  whether  a  graph  has  a  Hamilton  cycle  to  this  approximate  count¬ 
ing  problem. 

Theorem  16  It  is  NP-hard  to  count  the  number  of  s  —  t  simple  paths  in  a  graph  with 
n  vertices  to  any  poly(n)  factor. 

Proof.  Let  be  two  numbers  which  we  will  fix  later.  We  replace  each 

edge  of  G  by  Ik  edges:  first  divide  the  edge  into  /  edges  by  introducing  /  —  1  new 
vertices,  and  then  replace  each  resulting  edge  with  k  parallel  edges. 

The  number  of  cycles  in  the  new  graph  G'  will  tell  us  if  the  original  graph  G  is 
hamiltonian.  If  G  is  Hamiltonian  then  G'  has  at  least  =  A:'”  cycles  (correspond¬ 
ing  to  a  single  Hamilton  cycle).  On  the  other  hand,  if  G  has  no  cycles  of  length  n,  an 
upper  bound  on  the  number  of  cycles  in  G'  is  because  there  are  fewer  than 

2"n!  cycles  in  G  and  each  has  at  most  representatives  in  G' .  By  a  suitable 

choice  of  A:,  /  with  each  only  0(n),  we  can  make  the  first  number  higher  than  any 
fixed  poly{n)  times  the  second  number.  Thus  counting  the  number  of  cycles  in  G'  up 
to  any  poly{n)  factor  would  let  us  decide  if  G  is  Hamiltonian. 

Finally,  the  number  of  cycles  involving  a  particular  edge  (u,  u)  is  just  the  number 
of  paths/flows  from  u  to  u  in  the  graph  with  the  edge  deleted  and  all  capacities  set 
to  1.  □ 

5.3.4  6-matchings 

Given  an  undirected  graph,  G  =  (V,E),  and  a  function  b  :  V  Z"*",  a  b-matching 
is  an  assignment  of  positive  integers  to  edges,  x  :  E  -¥  Z'^ .,  so  that  the  sum  of  the 
weights  on  edges  incident  at  a  vertex  v  is  at  most  h{v).  A  6-matching  is  perfect  if  it 
has  a  weight  of  exactly  b{v)  at  every  vertex.  A  perfect  6-matching  can  be  found  in 
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polynomial  time  (if  one  exists)  using  the  ellipsoid  algorithm  [34].  Here  we  consider 
the  problem  of  sampling  from  the  set  of  6-matchings  uniformly  at  random. 

Let  P  be  the  polytope  defined  by  the  following  constraints. 

x{e)  >0  \/e  G  E 
x{5{t’))  <  b{v)  'iv  GV 

In  the  second  set  of  constraints,  6(t>)  represents  the  edges  incident  to  v  and  a;(6(n)) 
is  the  sum  of  the  weights  on  these  edges.  Any  integer  solution  satisfying  the  above 
constraints  is  a  valid  6-matching  of  G. 

Let  \E\  =  m,  and  d[v)  be  the  degree  of  a  vertex  v.  To  apply  the  theorem,  we  use 
the  following  substitution, 

y{e)  =  x{e)  —  m. 

Then  in  y-space,  we  have  the  following  polytope 

y(e)  >  -m 

y{5{v))  <  b{v)  —  md{v). 

Now  for  any  6  such  that  b{v)  >  2md{v)  for  all  v  G  V,  and  in  particular  for  6  such 
that  each  b{v)  >  2mn,  we  can  sample  6-matchings  of  G  nearly  uniformly.  This  can  be 
extended  to  sampling  capacitated  6-matchings,  where  there  are  additional  constraints 
on  the  maximum  weights  assigned  to  edges  [34]. 


5.4  Tight  examples 

In  this  section  we  give  examples  to  show  that  our  bound  on  the  components  of  6  is 
tight  (up  to  a  ydog  r  factor).  Consider  the  simplex  in  n  dimensions, 


n 

P  =  {x  G  R"  ■  '^Xi  <  b,  Xi  >  0  Vz} 

i=l 


To  derive  an  upper  bound,  we  can  reformulate  P  (thereby  centering  it  around  the 
origin).  We  substitute. 


yi  =  Xi-n. 


Then  in  y-space,  P  becomes. 
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<b  -  rP,  Vi  >  -n  Vi. 


So  now  for  6  >  +  7iy/n,  P  satisfies  the  conditions  of  the  theorem. 

Let  us  examine  what  happens  if  b  is  slightly  smaller.  The  volume  of  the  simplex 
is  6”/n!  and  the  number  of  lattice  points  in  P,  i.e.,  the  number  of  ways  of  dividing  b 
.  .  (  b  +  n-l  \  . . .  . 


into  n  or  fewer  parts  is 
to  the  volume  is 


which  is  the  same  as 


n  —  I 


.  So  the  ratio  of  the  number  of  lattice  points 


n{b  +  n  —  1)! 
¥b\ 


^  “  lx/,  n  —  2.  _  1, 

^(i  +  — )(i  +  — Mi  +  j). 


The  latter  is  lower  bounded  by 


n  .  n  —  l.n 


For  any  c  such  that  b  <  crP,  this  is  exponential  mile  (the  ratio  is  about  e'),  showing 
that  the  volume  is  no  longer  a  good  approximation  to  the  number  of  lattice  points. 


5.5  Conclusion  and  open  problems 

Is  the  dependence  of  our  main  theorem  on  m,  the  number  of  facets,  necessary?  It 
would  be  nice  to  get  rid  of  it. 

Although  we  have  presented  a  (nearly)  tight  condition  for  sampling  lattice  points, 
it  is  possible  that  a  different  point  of  view  would  yield  more  general  algorithms.  For 
example,  for  lattice  point  based  random  walks,  what  are  some  simple,  easy-to-verify 
conditions  that  guarantee  rapid-mixing? 
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